HK1203678B - Methods and systems for generating green score using data and sentiment analysis - Google Patents
Methods and systems for generating green score using data and sentiment analysis Download PDFInfo
- Publication number
- HK1203678B HK1203678B HK15104115.5A HK15104115A HK1203678B HK 1203678 B HK1203678 B HK 1203678B HK 15104115 A HK15104115 A HK 15104115A HK 1203678 B HK1203678 B HK 1203678B
- Authority
- HK
- Hong Kong
- Prior art keywords
- green
- score
- computer
- risk
- entity
- Prior art date
Links
Description
技术领域Technical Field
本发明总体上涉及财经服务,并且涉及从传统新闻源和新/社交媒体源以及其他内容源挖掘信息,以辨识情绪和预测针对定价和推荐的行为。更具体来说,本发明提供使得能够对如由传统和新媒体所感知的和/或用于生成复合“环境”索引的公司和相关联的风险领域的“绿色性”以及预测性企业估价行为进行测量和/或评分的智能分析法。本发明提供一种动态工具,其利用机器学习能力、新闻情绪专长和智能分析法来提供用于对私有和公开交易的公司的环境和可持续性情绪定基准的服务。The present invention relates generally to financial services and to mining information from traditional news sources and new/social media sources and other content sources to identify sentiment and predict behavior for pricing and recommendations. More specifically, the present invention provides intelligent analytics that enable the measurement and/or scoring of the "greenness" of companies and associated risk areas as perceived by traditional and new media and/or used to generate a composite "environmental" index, as well as predictive corporate valuation behavior. The present invention provides a dynamic tool that leverages machine learning capabilities, news sentiment expertise, and intelligent analytics to provide services for benchmarking environmental and sustainability sentiment of private and publicly traded companies.
背景技术Background Art
随着印刷机、排版、打字机、计算机实现的字处理和大容量数据存储的出现,人类所生成的信息量引人注目地并且以不断加快的速度增多。近来,包括“社交媒体”的较不正式的内容源已变得越来越盛行。如与其中实质上是被动(内容被阅读)的传统媒体相对,社交媒体更交互、即时并且常常导致更快的响应或反应时间。作为结果或增长和多样化的信息源,存在针对如下持续的并增长的需要:收集和存储、标识、跟踪、分类和编目、以及对该增长的信息/内容的海洋进行处理并且递送价值增加的服务,以促进对从此类信息导出的数据和预测性模式的明智的使用。对于例如因特网之类的高速网络的发展、广泛部署和可访问性,存在针对适当且高效地处理在此类网络上可获得的数量不断增多的内容以帮助决策制定的增长的需要。特别地,存在针对如下的需要:快速地处理与当前事件相关的信息以使得能够根据当前事件或相关情绪的影响制定明智的决策,并且考虑此类事件和情绪对所交易的证券或其他供应品的价格可能具有的影响。博客、维基、论坛、聊天室和社交媒体的广泛可用性和访问使得越来越多的信息收受者能够表达关于人、公司、政府和商业产品的意见。对于信息的实际上即时和同时的访问能够提高事件与股票价格之间的相关性。With the advent of the printing press, typesetting, typewriters, computer-enabled word processing, and mass data storage, the amount of information generated by humanity has increased dramatically and at an ever-increasing rate. Recently, less formal content sources, including "social media," have become increasingly prevalent. As opposed to traditional media, which is essentially passive (content is read), social media is more interactive, immediate, and often results in faster responses or reaction times. As a result, or as a result of the growth and diversification of information sources, there is a continuous and growing need to collect and store, identify, track, categorize, and catalog, as well as process this growing ocean of information/content and deliver value-added services to promote the intelligent use of data and predictive patterns derived from such information. With the development, widespread deployment, and accessibility of high-speed networks such as the Internet, there is a growing need to appropriately and efficiently process the ever-increasing amount of content available on such networks to aid decision-making. In particular, there is a need to quickly process information related to current events to enable informed decisions based on the impact of current events or related sentiment, and to consider the potential impact such events and sentiment may have on the prices of traded securities or other offerings. The widespread availability and access to blogs, wikis, forums, chat rooms, and social media has enabled a growing number of information recipients to express opinions about people, companies, governments, and commercial products. Virtually instant and simultaneous access to information can increase the correlation between events and stock prices.
在包括财经服务业在内的许多领域和行业中,例如存在内容和增强体验提供商,诸如The Thomson Reuters Corporation、Wall Street Journal、Dow Jones NewsService、Bloomberg、Financial News、Financial Times、News Corporation、Zawya、NewYork Times。此类提供商标识、收集、分析和处理关键数据,以供用于生成供相应行业内所涉及的专业人士和其他人士(例如财经顾问和投资者)消费的诸如报告和文章之类的内容中。采用内容递送的一种方式,这些财经新闻服务提供实时和归档这二者的财经新闻馈送,其包括投资者所感兴趣的针对近来发生的事件而写的文章和其他报告。这些文章和报告中的许多当然以及潜在事件对与公开交易的公司相关联的交易股票价格可能具有可测量的影响。尽管本文常常就公开交易股票(例如在诸如NMASDAQ和纽约股票交易所之类的市场上交易的)方面进行讨论,但是本发明不限于股票并且包括对其他形式的投资和证书的应用。各行各业中的专业人士和提供商持续寻求增强向订户、客户和其他顾客提供的内容、数据和服务的方式,并且寻求在竞争当中脱颖而出的方式。此类提供商致力于创建并且提供包括搜索和排名工具的增强工具,以使得客户能够更高效并且有效地处理信息以及做出明智的决策。In many fields and industries, including the financial services industry, there are content and enhanced experience providers such as The Thomson Reuters Corporation, Wall Street Journal, Dow Jones News Service, Bloomberg, Financial News, Financial Times, News Corporation, Zawya, and the New York Times. Such providers identify, collect, analyze, and process key data for use in generating content such as reports and articles for consumption by professionals and others involved in the corresponding industries (e.g., financial advisors and investors). Using one method of content delivery, these financial news services provide both real-time and archived financial news feeds that include articles and other reports on recent events of interest to investors. Many of these articles and reports, both actual and potential events, can have a measurable impact on the trading stock prices associated with publicly traded companies. While this article often discusses publicly traded stocks (e.g., those traded on markets such as the NMASDAQ and the New York Stock Exchange), the present invention is not limited to stocks and includes application to other forms of investments and securities. Professionals and providers across a wide range of industries are continually seeking ways to enhance the content, data, and services they offer to subscribers, clients, and other customers, and to differentiate themselves from the competition. Such providers are committed to creating and delivering enhanced tools, including search and ranking tools, to enable customers to process information more efficiently and effectively and make informed decisions.
包括数据库挖掘和管理、搜索引擎、语言识别和建模的技术方面的进步提供了用以搜索和处理大量数据和文档(例如新闻文章、财经报告、博客、SEC和其他所要求的企业公开、法律判决、法令、法律以及规章的数据库)的越来越精密的方法,其可能会影响经营业绩并且因此影响与由此类股权所构成的股票、证券或基金相关的价格。投资和其他财经专业人士以及其他用户越来越依赖于数学模型和算法来做出专业和经营决定。特别在投资领域中,提供对与企业业绩相关的(准确)新闻和其他信息的更快访问和处理的系统将是专业人士的高度有价值的工具,并且将导致更明智、并且更成功的决策制定。Advances in technologies including database mining and management, search engines, language recognition, and modeling have provided increasingly sophisticated methods for searching and processing large amounts of data and documents (e.g., news articles, financial reports, blogs, SEC and other required corporate disclosures, legal decisions, decrees, laws, and regulatory databases) that may affect business performance and, therefore, the prices associated with the stocks, securities, or funds comprising such holdings. Investment and other financial professionals, as well as other users, increasingly rely on mathematical models and algorithms to make professional and business decisions. Particularly in the investment field, systems that provide faster access to and processing of (accurate) news and other information related to corporate performance would be highly valuable tools for professionals and would lead to more informed and successful decision making.
许多财经服务提供商使用“新闻分析”或“新闻分析法”来向订户和顾客提供增强的服务,所述“新闻分析”或“新闻分析法”指的是包含并且涉及信息检索、机器学习、统计学习理论、网络理论以及合作过滤的广阔领域。新闻分析法包括被用来领悟、概括、分类以及以其他方式分析信息源(常常是公开的“新闻”信息)的技术、公式和统计学以及相关的工具和度量的集。新闻分析法的示例性使用是领悟(即阅读和分类)财经信息以确定与此类信息相关的市场影响同时规范化针对其他效果的数据的系统。新闻分析指的是测量和分析文本新闻报道的各种定性和定量的属性,诸如出现在正式的基于文本的文章中以及出现在诸如博客和其他在线媒介物之类的较不正式的递送方式中的属性。更特别地,本发明关注电子内容的上下文中的分析。属性包括:情绪、关联性和新颖性。把新闻报道表达或表示为“数字”或其他数据点使得系统能够将传统的信息表达变换成可更容易分析的数学和统计表达。新闻分析技术和度量可以被用于财经上下文中,并且更特别地用于过去的和预测性的投资业绩的上下文中。Many financial service providers use "news analytics" or "news analytics" to provide enhanced services to subscribers and customers. "News analytics" or "news analytics" refers to a broad field encompassing and involving information retrieval, machine learning, statistical learning theory, network theory, and collaborative filtering. News analytics encompasses a set of techniques, formulas, statistics, and related tools and metrics used to comprehend, summarize, categorize, and otherwise analyze information sources (often publicly available "news" information). An exemplary use of news analytics is a system that comprehends (i.e., reads and categorizes) financial information to determine the market impact associated with such information while normalizing the data for other effects. News analytics refers to measuring and analyzing various qualitative and quantitative attributes of textual news reports, such as those found in formal, text-based articles and less formal delivery methods such as blogs and other online media. More specifically, the present invention focuses on analysis within the context of electronic content. Attributes include sentiment, relevance, and novelty. Expressing or representing news reports as "numbers" or other data points enables the system to transform traditional information representations into mathematical and statistical representations that are more easily analyzed. News analysis techniques and metrics can be used in a financial context, and more particularly in the context of past and predictive investment performance.
新闻分析法系统可以被用来测量和预测以下各项:收益、股票估价、市场的不稳定性;新闻影响的撤销;新闻与留言板信息的关系;用于预测负回报率的年报中的风险相关的词语的关联性;情绪;新闻报道对股票回报率的影响;以及确定新闻中的乐观性和悲观性对收益的影响。新闻分析法可以以三个级别或层来查看:文本、内容和上下文。许多努力集中于第一层——文本,即基于文本的引擎/应用对新闻的原始文本成分进行处理,即词语、短语、文档标题等等。文本可以被转换或利用成附加的信息,并且不相关的文本可以被丢弃,从而使其浓缩成具有较高关联性/有用性的信息。第二层(内容)表示文本的丰富性,其中能够由分析法进一步利用附加有例如质量和真实特性的更高的意义和重要性。文本可以被划分成“事实”或“意见”表达。新闻分析法的第三层(上下文)指的是信息项目之间的连通性或关系性。上下文还可以指的是新闻的网络关系。例如,Das和Sisk(2005)文章审视留言板帖子的社交网络,以确定是否可以基于股票之间的网络联系来形成资产组合规则。News analytics systems can be used to measure and predict the following: earnings, stock valuations, market volatility; the impact of news; the relationship between news and message board information; the correlation of risk-related terms in annual reports to predict negative returns; sentiment; the impact of news coverage on stock returns; and determining the impact of optimism and pessimism in news on earnings. News analytics can be viewed at three levels or layers: text, content, and context. Much effort has focused on the first layer—text—where text-based engines/applications process the raw textual components of news, such as words, phrases, and document titles. Text can be converted or utilized into additional information, and irrelevant text can be discarded, resulting in a concentration of information with higher relevance/usefulness. The second layer (content) represents the richness of the text, which can be further exploited by analytics to impart higher meaning and significance, such as quality and truthfulness. Text can be categorized as either "factual" or "opinion" expressions. The third layer (context) of news analytics refers to the connectivity or relationships between information items. Context can also refer to the network relationships within news. For example, Das and Sisk (2005) examined the social networks of message board posts to determine whether portfolio rules could be formed based on the network connections between stocks.
在基于文本、内容和上下文来处理新闻报道之后,投资者和财经服务中所涉及的那些期望理解此类大量信息(甚至经处理的信息)如何与公司的股票价格的可能变动相关。通常所使用的与公司风险相关的术语和测量形式是“Alpha”。如本申请中使用的,“Alpha”表示经风险调节的基础上的业绩的量度。例如,Alpha考虑证书(instrument)、股票、债券、共同基金等等的不稳定性(即价格风险),并且把经风险调节的业绩与另一业绩测量(例如基准或其他索引)进行比较。如与基准的回报率(例如索引)相比,投资媒介物(例如共同基金)的回报率就是投资媒介物的Alpha。此外,Alpha可以是指超过将通过均衡模型(像资本资产定价模型)所预测的情况的证券或资产组合的异常回报率。Alpha是五个被广泛考虑的技术风险比率之一。除了Alpha之外,在现代资产组合理论中所使用的其他技术风险因素统计测量包括:beta、标准偏差、R平方和Sharpe比率。这些统计风险指示符被投资企业用来确定股票、债券或者诸如共同基金之类的其他基于证书的投资媒介物的风险-报酬概况。例如在共同基金的情况下,正或负1.0的Alpha意味着该共同基金的业绩比其基准索引分别胜过正或负1%。相应地,如果资本资产定价模型分析基于资产组合的风险而估计该资产组合应当收益10%并且该资产组合实际收益15%,那么该资产组合的Alpha将是正5%,并且表示超出模型分析中所预测的情况的超额回报率。After processing news reports based on text, content, and context, investors and those involved in financial services want to understand how this large amount of information (even processed information) relates to the likely movement of a company's stock price. A commonly used term and measurement related to company risk is "alpha." As used in this application, "alpha" represents a measure of performance on a risk-adjusted basis. For example, alpha takes into account the volatility (i.e., price risk) of an instrument, such as a stock, bond, mutual fund, etc., and compares risk-adjusted performance to another performance measure (e.g., a benchmark or other index). The return of an investment vehicle (e.g., a mutual fund) compared to the return of a benchmark (e.g., an index) is the investment vehicle's alpha. Additionally, alpha can refer to the abnormal return of a security or portfolio that exceeds what would be predicted by an equilibrium model (such as the capital asset pricing model). Alpha is one of five widely considered technical risk ratios. In addition to alpha, other statistical measures of technical risk factors used in modern portfolio theory include beta, standard deviation, R-squared, and the Sharpe ratio. These statistical risk indicators are used by investment firms to determine the risk-return profile of stocks, bonds, or other investment vehicles based on securities, such as mutual funds. For example, in the case of a mutual fund, an alpha of positive or negative 1.0 means that the mutual fund outperformed its benchmark index by positive or negative 1%, respectively. Accordingly, if a Capital Asset Pricing Model analysis estimates that a portfolio should return 10% based on its riskiness, and the portfolio actually returns 15%, then the portfolio's alpha would be positive 5%, representing an excess return over and above what the model analysis predicted.
特别地,如其涉及本发明,来自政府管理机构和日益有“绿色”意识的公众的渐进的压力已经导致感兴趣的各方(例如投资界和财经服务行业中的其他各方)针对用以评价公司/投资的“绿色性”的程度(或者绿色分数或因数)和/或环境合规性以及用以管理风险承担的关键领域的新工具的日益增长的需求。关注绿色/环境投资的投资企业和管理者需要一种解决方案,其提供关系到公司的绿色性和/或环境合规性的信息以及用于对其进行评价的工具。本文所使用的“绿色性”指的是公司的产品、制造、分发、包装或其他企业实践,如其涉及公司及其产品的环境影响。例如,产品的绿色分数可以考虑如下内容:包括在产品中的再循环材料的使用、操作产品所需的能量的量、产品的电磁效应,以及产品发出的有害排出或污染的量。国家和地区已经颁布了关系到产品操作以及此类产品的处置、回收和处理的立法、规章、认证和标准以及其他要求(例如RoHS(EU))。某些制造过程和材料已被发现具有有害的环境影响,并且受到限制或管制。某些实践已被发现会促进或满足环境可持续性。在操作中,公司可能会“无纸化”,并且可以在其设施中包括环境友好的材料和系统。通过允许员工在家中工作可以促进减少对通勤的负担、减少自然资源的消耗以及减少有害的排放。Particularly as it relates to the present invention, increasing pressure from government regulators and an increasingly "green"-conscious public has led to a growing demand from interested parties (e.g., the investment community and others in the financial services industry) for new tools to assess the degree of "greenness" (or green score or factor) and/or environmental compliance of companies/investments, as well as key areas for managing risk exposure. Investment firms and managers focused on green/environmental investments need solutions that provide information related to a company's greenness and/or environmental compliance, as well as tools for evaluating them. As used herein, "greenness" refers to a company's products, manufacturing, distribution, packaging, or other corporate practices that relate to the environmental impact of the company and its products. For example, a product's green score may take into account the use of recycled materials, the amount of energy required to operate the product, the electromagnetic effects of the product, and the amount of harmful emissions or pollution emitted by the product. Countries and regions have enacted legislation, regulations, certifications, standards, and other requirements related to the operation of products and the disposal, recycling, and treatment of such products (e.g., RoHS (EU)). Certain manufacturing processes and materials have been found to have harmful environmental impacts and are subject to restrictions or regulations. Certain practices have been found to promote or address environmental sustainability. Companies may go "paperless" in their operations and incorporate environmentally friendly materials and systems into their facilities. Allowing employees to work from home can contribute to reduced commuting burdens, reduced consumption of natural resources, and reduced harmful emissions.
除了投资考虑之外,企业越来越觉知并且聚焦于结合治理、风险和合规性(GRC)、企业社会责任(CSR)倡议以及环境社会治理(ESG)倡议来进行绿色投资。需要的是一种解决方案,其有助于此类公司评价和跟踪其绿色投资和努力的有效性和业绩。需要的是一种工具,其有助于管理市场以及由于负面趋势导致的声誉风险并且证明与一些绿色/社会标准的某一级别的一致性。此外,管理机构和其他机构需要一种解决方案,其有助于他们在辩论、提议和颁布有影响力的绿色立法时标识和管理潜在热点,诸如具有环境关注的话题或地理区域。Beyond investment considerations, businesses are increasingly aware of and focused on integrating green investments with governance, risk, and compliance (GRC), corporate social responsibility (CSR), and environmental, social, and governance (ESG) initiatives. Solutions are needed to help these companies evaluate and track the effectiveness and performance of their green investments and efforts. Tools are also needed to help manage market and reputational risks arising from negative trends and demonstrate a certain level of compliance with green/social standards. Furthermore, regulatory agencies and other institutions require solutions that help them identify and manage potential hotspots, such as topics or geographic areas of environmental concern, when debating, proposing, and enacting influential green legislation.
绿色相关的行为可能具有对各种问题的严重影响,从而直接和间接地影响企业、市场索引以及股权、债券等等的投资者。绿色相关的事件影响估价和行为的近期示例是发生在墨西哥湾的Louisiana海岸的离岸钻井平台的爆炸,并且从而导致石油泄漏灾难。该事件很大地影响了若干实体的财经业绩,包括公开交易的British Petroleum(“BP”)。该灾难的新闻具有使得BP普通股在灾难当天以及随后的几天急剧下跌的立即影响。除了与资产损失、石油清理成本、受到泄漏的有害影响的人们提出的赔偿要求之外,BP还遭受到作为结果的政治和社会附带后果。Exxon Valdez油轮搁浅以及作为结果的泄漏是另一个此类示例。虽然存在许多组织跟踪此类事件并且可能保存表示相对业绩的公司记分卡,但是并不存在高效地监视事件并且向投资者提供关系到此类事件可能如何影响企业业绩(例如股票价格)的同时信息的系统。Green-related actions can have significant impacts on a variety of issues, directly and indirectly affecting businesses, market indices, and investors in equities, bonds, and more. A recent example of a green-related event impacting valuations and behavior is the explosion of an offshore drilling platform off the Louisiana coast in the Gulf of Mexico, resulting in the oil spill disaster. This event significantly impacted the financial performance of several entities, including publicly traded British Petroleum ("BP"). News of the disaster had the immediate effect of causing BP's common stock to plummet on the day of the disaster and in the days that followed. In addition to asset losses, oil cleanup costs, and compensation claims from those adversely affected by the spill, BP also suffered the resulting political and social fallout. The Exxon Valdez oil tanker grounding and resulting spill is another such example. While many organizations track such events and may maintain company scorecards showing relative performance, there is no system that effectively monitors events and provides investors with simultaneous information on how such events may affect corporate performance (e.g., stock prices).
随着投资企业和管理者驱动针对绿色分析法的大部分增长并且具有最高预计需求,“绿色分析法”空间很丰富并且正在快速增长。绿色分析法空间内的现有产品通常落在三个类别之下:ESG风险解决方案、主题索引和基准、以及声誉监视。空间内的一个提供商是RiskMetrics/KLD,其专攻基于web(网络)的研究服务以及主题索引和碳分析法。财经服务公司通过索引和基于web的研究平台提供ESG产品。Societe General例如提供涵盖从人权到CSR的各种问题的主题索引。诸如FTSE、Dow Jones和Calvet Investments之类的其他参与方提供投资者可以用于定基准和资产组合构造的环境索引。在声誉监视空间内,诸如RepRisk和Factiva Insight之类的公司提供通过web部署的工具,其可以是基于广泛的智能或者是集中的,例如品牌风险,如其涉及环境问题。可以使用第三方源,以使得视觉地处理并且通过web部署分析者情绪,从而允许顾客按照企业和行业来监视负面绿色新闻。The "green analytics" space is rich and rapidly growing, with investment firms and managers driving much of the growth in green analytics and holding the highest projected demand. Existing products within the green analytics space generally fall into three categories: ESG risk solutions, thematic indexes and benchmarks, and reputation monitoring. One provider within this space is RiskMetrics/KLD, which specializes in web-based research services, thematic indexes, and carbon analytics. Financial services firms offer ESG products through indexes and web-based research platforms. Societe General, for example, offers thematic indexes covering a variety of issues from human rights to CSR. Other players, such as FTSE, Dow Jones, and Calvet Investments, offer environmental indexes that investors can use for benchmarking and portfolio construction. Within the reputation monitoring space, companies such as RepRisk and Factiva Insight offer web-deployed tools that can be broadly intelligence-based or focused, such as brand risk, as it relates to environmental issues. Third-party sources can be used to visually process and deploy analyst sentiment via the web, allowing customers to monitor negative green news by company and industry.
所有这些努力都存在缺点,包括遮蔽面向绿色的产品的固有冗余性。用以测量公司的绿色性的这些努力损害在于它们使用从其导出各项度量的相同源(即第三方研究、企业申报、规章)。此外,评定是由分析师进行的并且高度依赖于公开申报和二级研究的时间性,类似于与实时信用违约互换曲线竞争的信用评级机构所面对的困境。All of these efforts have drawbacks, including the inherent redundancy that obscures green-oriented products. These efforts to measure a company's greenness are flawed in that they use the same sources from which each metric is derived (i.e., third-party research, corporate filings, regulations). Furthermore, the assessments are performed by analysts and are highly dependent on the timeliness of public filings and secondary research, similar to the dilemma faced by credit rating agencies competing with real-time credit default swap curves.
当前,尽管存在不同的部署方法和视觉化,但是顾客面对基本上提供相同的人类驱动的研究工具的产品市场。服务于有绿色意识的零售和机构投资者的资产管理者可能发现难以利用这些工具来实现其投资绿色公司的委托,并且可能更重要地向其顾客传达这些投资的价值。近期由苏黎世大学进行的研究突显了该困境。使用来自RepRisk的ESG数据,所述研究将绿色基金的可持续性与常规股权基金的可持续性进行了比较。Currently, clients are faced with a market of products offering essentially the same human-driven research tools, albeit with varying deployment methods and visualizations. Asset managers serving green-minded retail and institutional investors may find it difficult to leverage these tools to fulfill their mandates to invest in green companies and, perhaps more importantly, to communicate the value of these investments to their clients. A recent study conducted by the University of Zurich highlights this dilemma. Using ESG data from RepRisk, the study compared the sustainability of green funds with that of conventional equity funds.
这些工具主要由相同的源来驱动并且基本分析意味着其能够产生不完全捕获与作为绿色相关联的感知的类似结果。可商榷的是,这些工具忽略了来自将巨大价值添加到决策制定的非传统源的潜在趋势。These tools are largely driven by the same sources and basic analysis means they can produce similar results that do not fully capture the perceptions associated with being green. Arguably, these tools ignore underlying trends from non-traditional sources that add enormous value to decision making.
相同的想法容易适用于企业和管理机构。面对针对监视其品牌以及管理由于较差CSR业绩和不良公关而导致的声誉风险的需要,企业需要一种定期更新并且采用系统方式利用大量新媒体的工具。更重要的是,其需要一种捕获其他产品所缺失的感知元素的工具。同时,管理机构现在的任务不仅是以行业级别而且以企业级别管理环境热点,特别是在所讨论的公司接受公共基金以供投资的情况下。The same idea easily applies to businesses and regulators. Faced with the need to monitor their brands and manage reputational risks stemming from poor CSR performance and bad PR, businesses need a tool that is regularly updated and systematically leverages the vast array of new media. More importantly, they need a tool that captures the perceptual elements missing from other offerings. At the same time, regulators are now tasked with managing environmental hotspots not only at the industry level but also at the corporate level, particularly when the companies in question receive public funding.
需要的是一种系统,其能够自动处理或“阅读”其可获得的新闻报道、申报、新/社交媒体和其他内容并且快速地解释所述内容以得到对评定(私有或公共)实体的环境影响更高的理解。此外还需要创建和应用预测性模型,以基于实体的环境影响来在股票和其他投资的实际变动之前预期所述股票价格和其他投资媒介物的行为。当前,存在针对如下内容的需要:使用和利用传统的并且特别是新媒体资源和趋势以及满足顾客对于与企业业绩、价格行为、投资和声誉觉知相关的先进分析法的需求,以提供一种基于情绪的解决方案,其将常规工具的范围扩展为包括社交媒体和在线新闻。What is needed is a system that can automatically process or "read" available news reports, filings, news/social media, and other content and quickly interpret it to gain a better understanding of the environmental impact of an entity (private or public). There is also a need to create and apply predictive models to anticipate the behavior of stock prices and other investment vehicles based on the entity's environmental impact before they actually change. There is a need to use and leverage traditional and especially new media sources and trends, as well as to meet consumer demand for advanced analytics related to corporate performance, price behavior, investments, and reputational awareness, to provide a sentiment-based solution that extends the scope of conventional tools to include social media and online news.
发明内容Summary of the Invention
本发明使用和利用新媒体资源和趋势来满足顾客对于与ESG委托、绿色投资和声誉觉知相关的先进分析法的需求。对于环境问题,社交媒体的影响越来越大。随着碳立法的公布以及朝向“绿色性”的全球文化的商业化,新媒体对环境和社会治理的影响将随着时间而增大。本发明在其实施例中提供了一种绿色情绪解决方案,其把常规工具的范围扩展为包括社交媒体和在线新闻,以生成并呈现增强的工具、内容和解决方案。本发明通过简单的分数来提供实体的环境行为的指示,所述分数可以是负的或正的并且随着时间演进。智能分析法允许顾客测量由常规的且新的媒体所感知的企业的“绿色性”。该解决方案聚合来自多个源、包括社交媒体内容的私有和公共的内容。分类法被调谐成将主题、文本、短语、语句、评论和其他内容理解为具有或不具有绿色或环境含义。结果可以采取以下各项中的一个或多个的形式:绿色分数、复合环境或绿色索引、以及绿色企业认证或分类。The present invention uses and utilizes new media resources and trends to meet customer demand for advanced analytics related to ESG mandates, green investments, and reputation awareness. Social media has an increasing influence on environmental issues. With the publication of carbon legislation and the commercialization of a global culture toward "greenness," the impact of new media on environmental and social governance will increase over time. The present invention provides a green sentiment solution in its embodiments that expands the scope of conventional tools to include social media and online news to generate and present enhanced tools, content, and solutions. The present invention provides an indication of an entity's environmental behavior through a simple score that can be negative or positive and evolves over time. Smart analytics allow customers to measure the "greenness" of a company as perceived by conventional and new media. The solution aggregates private and public content from multiple sources, including social media content. The classification method is tuned to understand topics, text, phrases, sentences, comments, and other content as having or not having green or environmental meanings. The results can take the form of one or more of the following: a green score, a composite environmental or green index, and a green company certification or classification.
在一个实现中,本发明提供一种新闻/媒体分析系统(NMAS)以及相关的方法,其被适配成尽可能接近实时地自动处理和“阅读”来自博客、twitter(推特)和其他社交媒体源的新闻报道和内容。本发明结合计算机科学采用定量的分析、技术或数学来得到绿色分数、绿色认证、和/或对财经证券的价值进行建模,包括生成复合环境索引。本发明提供一种用于自动处理或“阅读”新闻报道、申报、新/社交媒体和其他内容并且用于针对所述内容应用预测性模型以预期股票价格和其他投资媒介物的行为的系统。NMAS利用传统并且特别是新媒体资源来提供一种将常规工具的范围扩展为包括社交媒体和在线新闻的基于情绪的解决方案。In one implementation, the present invention provides a news/media analysis system (NMAS) and related methods that are adapted to automatically process and "read" news stories and content from blogs, twitter, and other social media sources in as close to real time as possible. The present invention employs quantitative analysis, techniques, or mathematics in conjunction with computer science to derive green scores, green certifications, and/or model the value of financial securities, including generating a composite environmental index. The present invention provides a system for automatically processing or "reading" news stories, filings, new/social media, and other content and for applying predictive models to said content to anticipate the behavior of stock prices and other investment vehicles. NMAS leverages traditional and especially new media resources to provide a sentiment-based solution that expands the scope of conventional tools to include social media and online news.
作为针对传统媒体源和递送手段的添加以及在某些方面作为其替代,“社交媒体”添加了远远超出常规的媒体形式的信息共享和收集的新级别。不受传统模型和工作流程的限制,博客和其他社交媒体形式已成为实时新闻和情况更新的非常容易访问并且范围广泛的源。在投资前线,像Seeking Alpha之类的新兴企业以及传统的财经新闻提供商正以索引比率走向博客圈和社交媒体。博客和其他新媒体已经成为投资建议的最重要源,并且对于一些而言超越传统源。“社交媒体”或社交网络源指的是非传统的、常常是较不正式的内容递送形式,并且包括交互式的源于用户或群众的数据和内容。社交媒体的示例包括:新闻网站(reuters.com、bloomberg.com等等);在线论坛(livegreenforum.com);政府机构的网站(epa.gov);学术机构、政党的网站(mcgill.ca/mse、www.democrats.org等等);在线杂志网站(emagazine.com/);博客网站(Blogger、ExpressionEngine、LiveJournal、Open Diary、TypePad、Vox、WordPress、Xanga等等);微博客网站(Twitter、FMyLife、Foursquare、Jaiku、Plurk、Posterous、Tumblr、Qaiku、Google Buzz、Identi.ca Nasza-Klasa.pl等等);社交和专业人士联网站点(facebook、myspace、ASmallWorld、Bebo、Cyworld、Diaspora、Hi5、Hyves、LinkedIn、MySpace、Ning、Orkut、Plaxo、Tagged、XING、IRC、Yammer等等);在线宣传和筹款网站(Greenpeace、Causes、Kickstarter);信息聚合商(Netvibes、Twine等等);以及Twitter。In addition to, and in some ways a replacement for, traditional media sources and means of delivery, "social media" has added a new level of information sharing and gathering that goes far beyond conventional media forms. Unconstrained by traditional models and workflows, blogs and other social media forms have become highly accessible and widespread sources of real-time news and updates. On the investment front, startups like Seeking Alpha, as well as traditional financial news providers, are moving into the blogosphere and social media at an exponential rate. Blogs and other new media have become the most important source of investment advice, and for some, surpass traditional sources. "Social media" or social network sources refer to non-traditional, often less formal forms of content delivery, and include interactive user- or crowd-sourced data and content. Examples of social media include: news websites (reuters.com, bloomberg.com, etc.); online forums (livegreenforum.com); government agency websites (epa.gov); websites for academic institutions and political parties (mcgill.ca/mse, www.democrats.org, etc.); online magazines (emagazine.com/); blogs (Blogger, ExpressionEngine, LiveJournal, Open Diary, TypePad, Vox, WordPress, Xanga, etc.); microblogging sites (Twitter, FMyLife, Foursquare, Jaiku, Plurk, Posterous, Tumblr, Qaiku, Google Buzz, Identi.ca Nasza-Klasa.pl, etc.); social and professional networking sites (Facebook, MySpace, ASmallWorld, Bebo, Cyworld, Diaspora, Hi5, Hyves, LinkedIn, MySpace, Ning, Orkut, Plaxo, Tagged, XING, IRC, Yammer, etc.); online advocacy and fundraising sites (Greenpeace, Causes, Kickstarter); information aggregators (Netvibes, Twine, etc.); and Twitter.
采用一种方式,对于实体的环境行为敏感的私人投资者可以使用本发明来监视和收集来自社交媒体的信息,所述信息在监视传统的“主流”或常规媒体时将不可采用其他方式从其获得或者至少滞后。随着新社交媒体的越来越广泛的采用,此类源正日益成为“主流”。此外,本发明可以被用来聚合来自若干社交媒体内容生产者的内容,以证实、验证或者以其他方式强化所收集的信息。In one embodiment, a private investor sensitive to the environmental behavior of an entity can use the present invention to monitor and collect information from social media that would not otherwise be available, or at least delayed, when monitoring traditional "mainstream" or conventional media. With the increasing adoption of new social media, such sources are becoming increasingly "mainstream." Furthermore, the present invention can be used to aggregate content from multiple social media content producers to verify, validate, or otherwise enhance the collected information.
NMAS可以包括情绪处理,以便处理新闻/媒体信息,并且向与一个或多个公司相关的新闻/媒体项目指派“情绪分数”。所述分数可以从来自新闻/媒体的文本和元数据导出,并且可以对经处理的文本/元数据应用预定义的或者所学习的基于词典的和/或情绪模式。NMAS可以包括训练或学习模块,其根据某些事件对过去的新闻/媒体以及作为结果的相关股票价格的响应进行分析,来构建用以在给定某些类型的新闻或事件的情况下预测股票行为的模型,其中包括与绿色或环境事件、凭证、立法等等相关的那些新闻或事件。NMAS can include sentiment processing to process news/media information and assign a "sentiment score" to news/media items related to one or more companies. The score can be derived from the text and metadata from the news/media, and predefined or learned lexicon-based and/or sentiment models can be applied to the processed text/metadata. NMAS can include a training or learning module that analyzes past news/media and the resulting related stock price responses based on certain events to build models for predicting stock behavior given certain types of news or events, including those related to green or environmental events, certifications, legislation, etc.
采用一种方式,本发明可以被用来将传统且新的媒体内容源处理为确定或表示“绿色性”或复合环境索引的上下文中的“Alpha”的源。在示例性的实现中,由传统财经服务公司运营的NMAS可以应用针对预测性模型的内部文本源和外部源,以得到预期的市场相关的行为。硬事实和情绪被视为驱动绿色评分和/或复合环境索引的因素。NMAS新闻/媒体情绪分析和绿色评分增强了投资和交易策略,并且导致明智的交易和投资决策。In one approach, the present invention can be used to process traditional and new media content sources as sources for determining or representing "greenness" or "alpha" in the context of a composite environmental index. In an exemplary implementation, an NMAS operated by a traditional financial services company can apply internal textual sources and external sources to predictive models to derive expected market-related behavior. Hard facts and sentiment are considered as factors driving green scores and/or composite environmental indices. NMAS news/media sentiment analysis and green scores enhance investment and trading strategies and lead to informed trading and investment decisions.
此外,本发明可以被用来生成具有环境意识或者环境友好的公司的分类系统,其充当用于绿色投资的分类系统。本发明可以被用来将一家公司分类或认证为“绿色合规”,以及用来创建由已经获取绿色认证的公司所构成的“绿色情绪索引”。绿色索引有可能吸引投资者对促进环境负责任的业务感兴趣。Furthermore, the present invention can be used to generate a classification system for environmentally conscious or environmentally friendly companies, serving as a classification system for green investing. The present invention can be used to classify or certify a company as "green compliant," as well as to create a "green sentiment index" comprised of companies that have received green certifications. Green indices have the potential to attract investors interested in promoting environmentally responsible businesses.
不像依赖于由分析师处理的周期性研究的其他方法,本发明持续处理媒体馈送并且产生信息和数据流,所述信息和数据流捕获日常趋势以及允许用户(例如顾客)访问一系列内容的门户和智能警报的附加价值。随着绿色或环境相关的新闻和社交媒体内容增多,媒体服务公司将利用例如Thomson Reuters Markets之类的跨广阔供应品平台的产品和服务。本发明使得公司能够把跨分区的供应品相联系,并且加速绿色分析法空间的市场占有率渗透。Unlike other approaches that rely on periodic research conducted by analysts, the present invention continuously processes media feeds and generates information and data streams that capture daily trends, along with the added value of portals and intelligent alerts that allow users (e.g., customers) access to a range of content. As green or environmental news and social media content grows, media services companies will leverage products and services across a broad offering platform, such as Thomson Reuters Markets. The present invention enables companies to connect offerings across sectors and accelerate market penetration in the green analytics space.
本发明可以被用来随着时间跟踪“绿色”情绪,以提供对于公司相关的新闻/媒体评论的分析,以及用以基于绿色或环境问题引导交易和投资决策的工具和分析法。本发明可以由利用语言学技术的自然语言处理来激励。本发明提供支持人类决策制定、风险管理和资产分配的定量“绿色”策略。本发明可以被用于做市(market making)中、用于资产组合管理中以通过对资产组合情绪定基准以及计算业界加权来改善资产分配决策、用于预报股票、业界和市场前景的基础分析、用于风险管理以更好地理解针对资产组合的异常风险并且以发展潜在的情绪防护,并且以跟踪并对公众感知和媒体覆盖定基准以及对于竞争者也这样做。The present invention can be used to track "green" sentiment over time to provide analysis of company-related news/media commentary, as well as tools and analytics to guide trading and investment decisions based on green or environmental issues. The present invention can be powered by natural language processing using linguistic techniques. The present invention provides quantitative "green" strategies that support human decision-making, risk management, and asset allocation. The present invention can be used in market making, in portfolio management to improve asset allocation decisions by benchmarking portfolio sentiment and calculating industry weightings, for fundamental analysis to forecast stock, industry, and market outlooks, in risk management to better understand unusual risks to the portfolio and to develop potential sentiment protections, and to track and benchmark public perception and media coverage, as well as competitor performance.
在第一实施例中,本发明提供一种。一种计算机实现的方法,包括:标识将向其指派绿色分数的实体;基于社交媒体信息集来计算绿色分数;以及传送该绿色分数。社交媒体信息集可以与所标识的实体相关联,所述实体与交易证券相关联,所述社交媒体信息集表示关系到所述实体的绿色信息,并且所述绿色分数表示所述实体的绿色性属性。所述方法还可以包括:基于附加的社交媒体信息集随时间修改所述绿色分数,以及传送经修改的绿色分数。计算绿色分数可以包括至少部分地基于所述社交媒体信息集来确定情绪分数,并且其中所述绿色分数表示绿色情绪分数。所述方法还可以包括:至少部分地基于绿色分数来将所述实体认证为绿色合规的;针对第二所标识的实体执行步骤(a)-(c)导致第二绿色分数,并且其中传送绿色分数包括传送包括反映所述绿色分数和第二绿色分数的数据的数据馈送;至少部分地基于所计算的绿色分数的值来生成关系到所述实体的警报信号;和/或在从社交媒体集接收的内容集中标识文本,所述文本被标识为表示所述实体,并且还包括从所述内容集提取被标识为表示与所述实体的绿色性相关的情绪的另外的文本。内容集中的所标识的文本可以包括以下各项中的一个或多个:标识所嵌入的元数据或其他描述符;处理文本、词语、短语;应用自然语言语言学分析;应用贝叶斯技术。所述方法还可以包括聚合来自多个源的内容集,所述多个源包括至少一个社交媒体源和来自包括以下项的组中的至少一个附加的源:新闻网站(reuters.com、bloomberg.com等等);在线论坛(livegreenforum.com);政府机构的网站(epa.gov);学术机构、政党的网站(mcgill.ca/mse、www.democrats.org);在线杂志网站(emagazine.com);博客网站(Blogger、ExpressionEngine、LiveJournal、Open Diary、TypePad、Vox、WordPress、Xanga);微博网站(Twitter、FMyLife、Foursquare、Jaiku、Plurk、Posterous、Tumblr、Qaiku、Google Buzz、Identi.ca、Nasza-Klasa.pl);社交和专业人士联网站点(facebook、myspace、ASmallWorld、Bebo、Cyworld、Diaspora、Hi5、Hyves、LinkedIn、MySpace、Ning、Orkut、Plaxo、Tagged、XING、IRC、Yammer);在线宣传和筹款网站(Greenpeace、Causes、Kickstarter);信息聚合商(Netvibes、Twine等等);Facebook;以及Twitter,所述内容集包括所述社交媒体信息集。所述方法还可以包括:标识要与复合索引相关联的公司集,所述公司集包括所述实体并且与证券集相关联;至少部分地基于所述社交媒体信息集来生成针对所述证券集的复合环境索引;以及传送与所述复合环境索引相关联的信号。所述复合环境索引至少部分地基于与所述公司集相关联的绿色分数集来确定,并且所述绿色分数可以被实时地生成和/或被得到是基于以下正面准则中的一个或多个:产品或制造环境相关的合规性或认证;能量效率;促进环境管理工作、消费者保护、人权和多样性的企业实践,在绿色技术、能量高效技术、替代燃料技术、可再生资源技术中所涉及的业务/产品,和/或基于以下负面准则中的一个或多个:在酒、烟草、赌博、武器和/或军事方面所涉及的业务,以及环境标准不合规的业务。所述实体与市场上交易的证券相关联,并且还包括应用预测性模型以得到与所述证券相关联的预测行为。所述方法还可以包括:生成预测行为的表达和/或要根据所述预测行为而采取的建议动作。所述建议动作涉及关系到所述证券的交易决策,并且是由买入、卖出或持有构成的组中的一项。所述社交媒体信息集基于时间价值来标识。所述方法还可以包括:生成表示潜在风险的风险信号;在计算设备上提供风险指示模式集;以及通过使用至少部分地基于所述风险指示模式集的风险标识算法来在所述社交媒体信息集内标识与所述实体相关联的潜在风险集;把所述潜在风险集与所述风险指示模式进行比较,以获得先决条件风险集;生成表示所述先决条件风险集的信号;以及把表示所述先决条件风险集的信号存储在电子存储器中。所述方法还可以包括:针对实体集重复步骤(a)-(c)导致针对所述实体集中的每个实体所计算的绿色分数;标识绿色分类;以及至少部分地基于相应的实体绿色分数针对包括在所述分类中从所述实体集选择实体。所述分类可以涉及将公司认证为绿色合规的,并且其中所选择的实体中的每个被认证为绿色合规的。In a first embodiment, the present invention provides a computer-implemented method comprising: identifying an entity to be assigned a green score; calculating a green score based on a set of social media information; and transmitting the green score. The set of social media information may be associated with the identified entity, the entity being associated with a traded security, the set of social media information representing green information related to the entity, and the green score representing a green attribute of the entity. The method may also include modifying the green score over time based on additional sets of social media information, and transmitting the modified green score. Calculating the green score may include determining a sentiment score based at least in part on the set of social media information, and wherein the green score represents a green sentiment score. The method may also include: authenticating the entity as green-compliant based at least in part on the green score; performing steps (a)-(c) for a second identified entity results in a second green score, and wherein transmitting the green score includes transmitting a data feed including data reflecting the green score and the second green score; generating an alert signal related to the entity based at least in part on the value of the calculated green score; and/or identifying text in a content set received from a social media collection that is identified as representing the entity, and further including extracting additional text from the content set that is identified as representing sentiment related to the greenness of the entity. The identified text in the content set may include one or more of: identifying embedded metadata or other descriptors; processing text, words, phrases; applying natural language linguistic analysis; applying Bayesian techniques. The method may further include aggregating a set of content from a plurality of sources, the plurality of sources including at least one social media source and at least one additional source from the group consisting of: news websites (reuters.com, bloomberg.com, etc.); online forums (livegreenforum.com); websites of government agencies (epa.gov); websites of academic institutions, political parties (mcgill.ca/mse, www.democrats.org); online magazine websites (emagazine.com); blog websites (Blogger, ExpressionEngine, LiveJournal, Open Diary, TypePad, Vox, WordPress, Xanga); microblogging websites (Twitter, FMyLife, Foursquare, Jaiku, Plurk, Posterous, Tumblr, Qaiku, Google Buzz, Identi.ca, Nasza-Klasa.pl); social and professional networking sites (facebook, myspace, ASmallWorld, Bebo, Cyworld, Diaspora, Hi5, Hyves, LinkedIn, MySpace, Ning, Orkut, Plaxo, Tagged, XING, IRC, Yammer); online advocacy and fundraising sites (Greenpeace, Causes, Kickstarter); information aggregators (Netvibes, Twine, etc.); Facebook; and Twitter, the content set comprising the social media information set. The method may also include: identifying a set of companies to be associated with a composite index, the set of companies including the entity and associated with the set of securities; generating a composite context index for the set of securities based at least in part on the set of social media information; and transmitting a signal associated with the composite context index. The composite environmental index is determined at least in part based on a set of green scores associated with the set of companies, and the green scores can be generated and/or derived in real time based on one or more of the following positive criteria: compliance or certification related to product or manufacturing environments; energy efficiency; corporate practices that promote environmental stewardship, consumer protection, human rights, and diversity; businesses/products involved in green technology, energy-efficient technology, alternative fuel technology, and renewable resource technology; and/or based on one or more of the following negative criteria: businesses involved in alcohol, tobacco, gambling, weapons, and/or the military; and businesses that do not comply with environmental standards. The entity is associated with securities traded in the market, and further includes applying a predictive model to derive predicted behavior associated with the securities. The method may also include generating an expression of the predicted behavior and/or a recommended action to be taken based on the predicted behavior. The recommended action relates to a trading decision related to the security and is one of the group consisting of buy, sell, or hold. The set of social media information is identified based on time value. The method may also include: generating a risk signal representing a potential risk; providing a risk indicator pattern set on a computing device; and identifying a potential risk set associated with the entity within the set of social media information using a risk identification algorithm based at least in part on the risk indicator pattern set; comparing the potential risk set to the risk indicator pattern to obtain a prerequisite risk set; generating a signal representing the prerequisite risk set; and storing the signal representing the prerequisite risk set in an electronic memory. The method may also include: repeating steps (a)-(c) for an entity set resulting in a green score calculated for each entity in the entity set; identifying a green classification; and selecting entities from the entity set for inclusion in the classification based at least in part on the corresponding entity green score. The classification may involve certifying a company as green compliant, and wherein each of the selected entities is certified as green compliant.
在第二实施例中,本发明提供一种基于计算机的系统,包括:被适配成执行代码的处理器;用于存储可执行代码的存储器;被适配成接收与将向其指派绿色分数的实体相关联的社交媒体信息集的输入;被适配成由处理器执行的绿色分数模块,并且所述绿色分数模块包括可由处理器执行以基于社交媒体信息集来计算绿色分数的代码;以及被适配成传送与所述绿色分数相关联的信号的输出。In a second embodiment, the present invention provides a computer-based system comprising: a processor adapted to execute code; a memory for storing the executable code; adapted to receive an input of a set of social media information associated with an entity to which a green score is to be assigned; a green score module adapted to be executed by the processor, and the green score module including code executable by the processor to calculate a green score based on the set of social media information; and an output adapted to transmit a signal associated with the green score.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了促进全面理解本发明,现在参照附图,其中利用相似的附图标记指代相似的元件。这些附图不应当被解释为限制本发明,而是意图为示例性的并且用于参照。In order to facilitate a full understanding of the present invention, reference is now made to the accompanying drawings, in which like elements are referred to using like reference numerals. These drawings should not be construed as limiting the present invention, but are intended to be illustrative and for reference purposes only.
图1是图示了用于实现本发明的示例性的基于计算机的系统的第一示意图;FIG1 is a first schematic diagram illustrating an exemplary computer-based system for implementing the present invention;
图2是图示了用于实现本发明的示例性的基于计算机的系统的第二示意图;FIG2 is a second schematic diagram illustrating an exemplary computer-based system for implementing the present invention;
图3是图示了实现本发明的示例性方法的搜索流程图;FIG3 is a search flow chart illustrating an exemplary method for implementing the present invention;
图4是图示了使用预测性建模作为采用本发明的系统的输入和输出的数据库和文档处理、情绪和绿色评分的流程图;FIG4 is a flow chart illustrating database and document processing, sentiment, and green scoring using predictive modeling as input and output of a system employing the present invention;
图5是表示结合本发明的用于产生情绪以供在绿色评分中使用的示例性方法的流程图;5 is a flow chart illustrating an exemplary method for generating emotions for use in green scoring in conjunction with the present invention;
图6是表示结合本发明的采用网站形式的绿色群体的表达的图表;FIG6 is a diagram showing the expression of a green group in the form of a website incorporating the present invention;
图7表示结合本发明的输出或服务的示例性形式;以及FIG. 7 shows exemplary forms of output or services incorporating the present invention; and
图8-16是用于在实现本发明中使用的风险挖掘技术的示例。8-16 are examples of risk mining techniques for use in implementing the present invention.
具体实施方式DETAILED DESCRIPTION
现在将参照如附图中所示出的示例性实施例更详细地描述本发明。虽然在本文参照示例性实施例描述了本发明,但是应当理解的是,本发明不限于此类示例性实施例。可以使用本文的教导的本领域技术人员将认识到附加实现、修改和实施例以及用于使用本发明的其他应用,其完全在本文所公开并要求保护的本发明的范围内被考虑,并且关于其本发明可以具有重要效用。The present invention will now be described in more detail with reference to the exemplary embodiments as shown in the accompanying drawings. Although the present invention is described herein with reference to exemplary embodiments, it should be understood that the present invention is not limited to such exemplary embodiments. Those skilled in the art who can use the teachings herein will recognize additional implementations, modifications and embodiments and other applications for using the present invention, which are fully contemplated within the scope of the invention disclosed and claimed herein, and with respect to which the present invention may have significant utility.
本发明使用和利用新媒体资源和趋势来满足顾客对于与CSR、ESG委托、绿色投资和声誉觉知相关的先进分析法的需要。本发明在其各个实施例中提供了一种绿色情绪解决方案,其把常规工具的范围扩展为包括社交媒体和在线新闻,以生成并呈现增强的工具、内容和解决方案。本发明包括对常规和新媒体进行分析以测量公司的“绿色性”以及表示实体的环境行为的作为结果的分数的智能分析法。所述绿色性分数可以是简单的分数,其可以是负的或正的并且可以随着时间演进。本发明聚合来自多个源、包括社交媒体或网络内容、新闻、网站以及机构新闻专线(例如Twitter、Facebook、网站、RSS)的私有和公共的内容。分类法被调谐成将主题、文本、短语、语句、评论和其他内容理解为具有或不具有绿色或环境含义。The present invention uses and leverages new media resources and trends to meet customer needs for advanced analytics related to CSR, ESG mandates, green investing, and reputation awareness. The present invention, in its various embodiments, provides a green sentiment solution that expands the scope of conventional tools to include social media and online news to generate and present enhanced tools, content, and solutions. The present invention includes intelligent analytics that analyze conventional and new media to measure the "greenness" of a company and the resulting score that represents the entity's environmental behavior. The greenness score can be a simple score that can be negative or positive and can evolve over time. The present invention aggregates private and public content from multiple sources, including social media or web content, news, websites, and institutional newswires (e.g., Twitter, Facebook, websites, RSS). The taxonomy is tuned to understand topics, text, phrases, sentences, comments, and other content as having or not having green or environmental meaning.
本发明可以包括情绪、感觉和情感计算技术,用以对文本进行分析以辨识关系到影响公司业绩的绿色问题的人类情绪,并且预期进一步的人类响应,例如卖出或买入与公司相关的证书。人类感情可以被视为时间导出函数,其具有一系列相关的因果或者“影响和效果”。例如,在一种给定情况下,例如面对潜在致命冲突的人,可以预期在恐惧的人类感情之后是一个或多个替代的人类响应,例如逃跑或防卫。可以使用概率数值或关系来表示针对所述情况的一个或多个预期未来反应。常常使用贝叶斯网络来表示因果关系。可以使用附加的数据来进一步精炼或者定义所述一个或多个概率关系。例如,如果受到威胁的人拥有武器,则可以向上调节自卫的概率并且向下调节逃跑的概率。同样地,如果此人被逼入角落或者以其他方式在逃离手段方面受到限制,则可以调节所述概率。本发明使用所检测到的人类感情来预期进一步的人类反应,并且是在集体基础上这样做。所述系统然后可以预测或预期针对该预期感情的人类响应,例如通常卖出股票或者卖出作为负面发布的对象的特定股票。本发明收集或使用或观察关系到作为在博客、维基、在线论坛、聊天室、留言板和社交媒体网络处表达的对象的人类感情,以检测关系到绿色问题的“情绪”,例如公司关于使用“绿色”或环境友好的原料或材料或实践的声明。本发明使用本文所讨论的技术对所收集的信息进行处理,以基于所确定的情绪导出绿色分数或评级。所述分数然后还可以被用来推荐公司或警报或者以其他方式标识公司以供投资考虑。本发明还可以被用来生成满足选择准则的公司的复合索引,此类准则与有环境意识或环境敏感的实践相关。采用该方式,投资者、个人、基金等等可以使用此类分数、评级或索引来作为投资决策的基础。The present invention may include emotion, feeling, and affect computing techniques to analyze text to identify human emotions related to green issues that impact a company's performance and predict further human responses, such as selling or buying company-related securities. Human emotions can be viewed as time-derived functions with a series of related causes and effects, or "effects and effects." For example, in a given situation, such as a person facing a potentially deadly conflict, a human emotion of fear can be expected to be followed by one or more alternative human responses, such as flight or defense. One or more expected future reactions to the situation can be represented using probabilistic values or relationships. Bayesian networks are often used to represent causal relationships. Additional data can be used to further refine or define the one or more probabilistic relationships. For example, if the person being threatened is armed, the probability of self-defense can be adjusted upward and the probability of flight can be adjusted downward. Similarly, if the person is cornered or otherwise limited in their means of escape, the probabilities can be adjusted. The present invention uses the detected human emotions to predict further human reactions, and does so on a collective basis. The system can then predict or predict human responses to the predicted emotions, such as generally selling a stock or selling a specific stock that is the subject of a negative press release. The present invention collects or uses or observes human emotions related to objects expressed on blogs, wikis, online forums, chat rooms, message boards and social media networks to detect "sentiment" related to green issues, such as statements by companies about using "green" or environmentally friendly raw materials or materials or practices. The present invention processes the collected information using the techniques discussed herein to derive a green score or rating based on the determined sentiment. The score can then also be used to recommend companies or alerts or otherwise identify companies for investment consideration. The present invention can also be used to generate a composite index of companies that meet selection criteria, such criteria being related to environmentally conscious or environmentally sensitive practices. In this way, investors, individuals, funds, etc. can use such scores, ratings or indexes as the basis for investment decisions.
采用一种实现,参照图1,本发明提供一种新闻/媒体分析系统(NMAS)100,其被适配成尽可能接近实时地自动处理和“阅读”来自由新闻/媒体全集110所表示的博客、twitter和其他社交媒体源的新闻报道和内容。结合计算机科学的定量分析、技术或数学(诸如绿色评分/复合模块124和情绪处理模块125)由服务器120的处理器121处理,以得到绿色分数、绿色认证和/或对财经证券的价值进行建模,包括生成复合环境或绿色索引。NMAS 100自动处理新闻报道、申报、新/社交媒体和其他内容,并且针对所述内容应用一个或多个模型,以确定绿色评分和/或股票价格和其他投资媒介物的预期行为。NMAS 100利用传统的并且特别是新媒体资源来提供一种将常规工具的范围扩展为包括社交媒体和在线新闻的基于情绪的解决方案。In one implementation, referring to FIG1 , the present invention provides a News/Media Analysis System (NMAS) 100 adapted to automatically process and "read" news stories and content from blogs, Twitter, and other social media sources represented by a News/Media Corpus 110 in as close to real time as possible. Quantitative analysis, techniques, or mathematics incorporating computer science, such as a green scoring/composite module 124 and a sentiment processing module 125, are processed by a processor 121 of a server 120 to derive green scores, green certifications, and/or model the value of financial securities, including generating composite environments or green indexes. NMAS 100 automatically processes news stories, filings, new/social media, and other content and applies one or more models to the content to determine green scores and/or expected behavior of stock prices and other investment vehicles. NMAS 100 leverages traditional and, in particular, new media sources to provide a sentiment-based solution that extends the scope of conventional tools to include social media and online news.
NMAS 100可以经由新闻/媒体全集110中的新媒体源1141、博客1142和社交媒体1143将来自以下示例性新的且社交媒体源的内容接收为输入:新闻网站(reuters.com、bloomberg.com等等);在线论坛(livegreenforum.com);政府机构的网站(epa.gov);学术机构、政党的网站(mcgill.ca/mse、www.democrats.org等等);在线杂志网站(emagazine.com/);博客网站(Blogger、ExpressionEngine、LiveJournal、Open Diary、TypePad、Vox、WordPress、Xanga等等);微博网站(Twitter、FMyLife、Foursquare、Jaiku、Plurk、Posterous、Tumblr、Qaiku、Google Buzz、Identi.ca Nasza-Klasa.pl等等);社交和专业人士联网站点(facebook、myspace、ASmallWorld、Bebo、Cyworld、Diaspora、Hi5、Hyves、LinkedIn、MySpace、Ning、Orkut、Plaxo、Tagged、XING、IRC、Yammer等等);在线宣传和筹款网站(Greenpeace、Causes、Kickstarter);信息聚合商(Netvibes、Twine等等);Facebook;以及Twitter。NMAS 100 can receive content from the following exemplary new and social media sources as input via new media sources 1141, blogs 1142, and social media 1143 in the news/media corpus 110: news websites (reuters.com, bloomberg.com, etc.); online forums (livegreenforum.com); government agency websites (epa.gov); websites of academic institutions, political parties (mcgill.ca/mse, www.democrats.org, etc.); online magazine websites (emagazine.com/); blog websites (Blogger, ExpressionEngine, LiveJournal, Open Diary, TypePad, Vox, WordPress, Xanga, etc.); microblogging websites (Twitter, FMyLife, Foursquare, Jaiku, Plurk, Posterous, Tumblr, Qaiku, Google Buzz, Identi.ca, etc.); Nasza-Klasa.pl, etc.); social and professional networking sites (Facebook, MySpace, ASmallWorld, Bebo, Cyworld, Diaspora, Hi5, Hyves, LinkedIn, MySpace, Ning, Orkut, Plaxo, Tagged, XING, IRC, Yammer, etc.); online advocacy and fundraising sites (Greenpeace, Causes, Kickstarter); information aggregators (Netvibes, Twine, etc.); Facebook; and Twitter.
图1的NMAS 100包括情绪处理模块125,其被适配成处理经由新闻/媒体全集110接收为输入的新闻/媒体信息,并且向与一个或多个公司相关的新闻/媒体项目指派“情绪分数”。情绪和情绪分数可以从计算语言学导出,并且例如通常利用相应的+1、-1和0的分数将文章、博客、社交媒体评论等等的基调定义或表示为正、负或中性。所述分数可以从来自新闻/媒体的文本和/或(现有的或者由引擎新指派的)元数据导出,并且可以对经处理的文本/元数据应用预定义的或者所学习的基于词典的和/或情绪模式。NMAS 100可以包括训练或学习模块127,其根据某些“事实”或事件对过去的或归档的新闻/媒体以及作为结果的相关股票价格的响应进行分析,来构建用以在给定某些类型的新闻或事件的情况下预测股票行为的模型,包括与绿色或环境事件、凭证、立法等等相关的新闻或事件。The NMAS 100 of FIG1 includes a sentiment processing module 125 adapted to process news/media information received as input via the news/media corpus 110 and assign a "sentiment score" to news/media items related to one or more companies. Sentiment and sentiment scores can be derived from computational linguistics and, for example, typically define or represent the tone of an article, blog post, social media comment, etc. as positive, negative, or neutral, using scores of +1, -1, and 0, respectively. The scores can be derived from text and/or metadata (either existing or newly assigned by the engine) from the news/media, and predefined or learned lexicon-based and/or sentiment models can be applied to the processed text/metadata. The NMAS 100 can also include a training or learning module 127 that analyzes past or archived news/media and the resulting associated stock price responses based on certain "facts" or events to build models for predicting stock behavior given certain types of news or events, including news or events related to green or environmental issues, campaigns, legislation, and the like.
采用一种方式,NMAS 100可以被用来将传统和新媒体内容源110处理为确定或表示“绿色性”或复合环境索引的上下文中的“Alpha”的源。在示例性的实现中,NMAS 100由传统的财经服务公司(例如Thomson Reuters)运营,其中主要数据库——内部112是内部文本源(例如TR News和TR Feeds),并且NMAS 100针对绿色评分模块124和情绪处理模块125应用数据并且可以包括用以得到预期的市场相关的行为的预测性模型。例如,作为内部主要数据库的Thomson Reuters源可以包括法律源(Westlaw)、规章(特别是SEC、争议数据、业界特定等等)、社交媒体(应用特殊的元数据以使其有用)以及新闻(Thomson Reuters News)和类新闻源,包括财经新闻和报告。此外还可以使用自由可用的或者基于预订的外部源114来补充内部源112,作为由所述预测性模型考虑的附加数据点。硬事实(例如油井爆炸导致直接财经损失(收入损失、损害赔偿等等)以及负面环境影响和作为结果的负绿色性分数)和情绪(例如定量恐惧、不确定性、负面声誉等等的效应)被视为驱动绿色评分和/或复合环境或绿色索引的因素。结果可以被用来增强投资和交易策略(例如股票和其他股权、债券和商品),并且使得用户能够跟踪和发现新的机会以及生成Alpha。新闻/媒体情绪分析125可以结合绿色评分模块124用来提供绿色评分,以驱动明智的交易和投资决策。In one approach, NMAS 100 can be used to process traditional and new media content sources 110 as sources for determining or representing "greenness" or "alpha" in the context of a composite environmental index. In an exemplary implementation, NMAS 100 is operated by a traditional financial services company (e.g., Thomson Reuters), where the primary database, internal 112, is an internal text source (e.g., TR News and TR Feeds). NMAS 100 applies this data to a green scoring module 124 and a sentiment processing module 125 and may include predictive models for deriving expected market-related behavior. For example, Thomson Reuters sources, as the primary internal database, may include legal sources (Westlaw), regulations (particularly SEC, dispute data, industry-specific, etc.), social media (with special metadata applied to make them useful), and news (Thomson Reuters News) and news-like sources, including financial news and reports. Furthermore, external sources 114, either freely available or subscription-based, may be used to supplement internal sources 112 as additional data points considered by the predictive models. Both hard facts (e.g., an oil well explosion resulting in direct financial losses (lost revenue, damages, etc.) and negative environmental impacts, resulting in a negative greenness score) and sentiment (e.g., the effects of quantitative fear, uncertainty, negative reputation, etc.) are considered factors driving green scores and/or composite environmental or green indexes. The results can be used to enhance investment and trading strategies (e.g., stocks and other equities, bonds, and commodities), enabling users to track and discover new opportunities and generate alpha. News/media sentiment analysis 125 can be combined with the green scoring module 124 to provide green scores to drive informed trading and investment decisions.
此外,NMAS 100可以包括绿色分类模块128,其被适配成生成有环境意识或者环境友好的公司的分类系统,其充当用于绿色投资的分类系统并且可以被用来创建复合环境索引。例如,当前被指派RIC(Reuters证书代码,其是被用来标识财经证书和索引的类标贴(ticker)代码)的公司可以被分类为“绿色合规”(例如被归档/保持具有某一级别和/或持续时间的绿色分数)。采用该方式,本发明可以出于交易目的被用来创建绿色RIC分类。例如,可以生成和保持例如由已经获取绿色认证或绿色RIC等等的公司所构成的“绿色情绪索引”。绿色索引有可能吸引投资者对促进环境负责任的业务感兴趣。Furthermore, the NMAS 100 may include a green classification module 128, which is adapted to generate a classification system for environmentally conscious or environmentally friendly companies, serving as a classification system for green investing and can be used to create a composite environmental index. For example, a company currently assigned a RIC (Reuters Certification Code, a class of ticker codes used to identify financial certifications and indices) may be classified as "green compliant" (e.g., documented/maintained with a green score of a certain level and/or duration). In this manner, the present invention can be used to create a green RIC classification for trading purposes. For example, a "green sentiment index," such as one comprised of companies that have obtained green certifications or green RICs, may be generated and maintained. A green index has the potential to attract investors interested in promoting environmentally responsible businesses.
在一个实施例中,NMAS 100可以包括训练或机器学习模块127(诸如ThomsonReuters的Machine Learning Capabilities and News Analytics(机器学习能力和新闻分析法)),以从环境数据、新闻和社交媒体的广阔全集导出洞察,从而以公司(例如IBM)和索引级别(例如S&P 500)提供规范化的绿色分数。该历史数据库或全集可以与新闻/媒体全集110相分离或者从其导出。In one embodiment, the NMAS 100 may include a training or machine learning module 127 (such as ThomsonReuters' Machine Learning Capabilities and News Analytics) to derive insights from a broad corpus of environmental data, news, and social media to provide normalized green scores at the company (e.g., IBM) and index level (e.g., S&P 500). This historical database or corpus may be separate from or derived from the news/media corpus 110.
优选的是,公司的绿色分数或索引被接近实时地(例如大约150ms)计算,并且例如被用来发展针对投资的Alpha策略,监视公司的绿色声誉,以及以公司和行业级别标识改变风险概况。不像依赖于由分析师处理的周期性研究的其他方法,本发明接收并且连续处理除了传统源之外的媒体馈送,例如WWW web和社交媒体馈送。采用一种方式,本发明例如产生信息和数据流,所述信息和数据流捕获日常趋势以及允许用户(例如顾客)访问来自例如相关的和无关的产品(例如其他Thomson Reuters产品)的一系列内容的门户和智能警报的附加价值。随着绿色或环境相关的新闻和社交媒体内容增多,媒体服务公司可以利用例如Thomson Reuters Markets之类的跨广阔供应品平台的产品和服务。本发明使得公司能够把跨分区的供应品相联系,并且加速绿色分析法空间的市场占有率渗透。Preferably, the company's green score or index is calculated in near real time (e.g., approximately 150ms) and is used, for example, to develop an Alpha strategy for investment, monitor a company's green reputation, and identify changes in risk profiles at the company and industry level. Unlike other methods that rely on periodic research processed by analysts, the present invention receives and continuously processes media feeds other than traditional sources, such as WWW web and social media feeds. In one embodiment, the present invention generates information and data streams that capture daily trends and allow users (e.g., customers) to access portals and smart alerts for a range of content from, for example, related and unrelated products (e.g., other Thomson Reuters products). As green or environmentally related news and social media content increases, media service companies can utilize products and services across a broad supply platform, such as Thomson Reuters Markets. The present invention enables companies to connect supplies across divisions and accelerate market share penetration in the green analytics space.
例如,由NMAS 100的绿色评分模块124应用的绿色分数准则可以包括:产品或制造环境相关的合规性或认证;能量效率;促进环境管理工作、消费者保护、人权和多样性的公司实践。由NMAS 100应用的绿色分数准则还可以包括:用于在绿色技术、能量高效技术、替代燃料技术、可再生资源技术中涉及的业务/产品的正面属性或分数,以及用于在酒、烟草、赌博、武器和/或军事方面所涉及的业务的负面属性或分数。由SRI行业所认识到的关注领域可以被概括为环境、社会正义和企业治理(ESG)。尽管在绿色性和环境合规性方面进行了描述,但是本发明也可以被应用在基于社会目标和追求来创建健康的生活方式或者用于对公司进行评分的其他分类的方面。For example, the green score criteria applied by the green scoring module 124 of the NMAS 100 may include: compliance or certification related to the product or manufacturing environment; energy efficiency; and corporate practices that promote environmental stewardship, consumer protection, human rights, and diversity. The green score criteria applied by the NMAS 100 may also include: positive attributes or scores for businesses/products involved in green technology, energy-efficient technology, alternative fuel technology, and renewable resource technology, as well as negative attributes or scores for businesses involved in alcohol, tobacco, gambling, weapons, and/or the military. Areas of concern recognized by the SRI industry can be summarized as environment, social justice, and corporate governance (ESG). Although described in terms of greenness and environmental compliance, the present invention may also be applied to creating a healthy lifestyle based on social goals and aspirations, or other categories for scoring companies.
NMAS 100可以由在处理新闻/媒体数据和递送给其的内容方面利用语言学技术来处理的自然语言处理所激励。NMAS 100对公司相关的新闻/媒体评论进行分析,以随时间跟踪“绿色”情绪。由NMAS 100提供的定量“绿色”策略可以被用于做市中,用于资产组合管理中以通过对资产组合情绪定基准以及计算业界加权来改善资产分配决策,用于预报股票、业界和市场前景的基础分析中,用于风险管理中以更好地理解针对资产组合的异常风险以及发展潜在的情绪防护,并且以跟踪并对公众感知和媒体覆盖定基准以及对于竞争者也这样做。NMAS 100 can be powered by natural language processing that utilizes linguistic techniques to process news/media data and content delivered to it. NMAS 100 analyzes company-related news/media commentary to track "green" sentiment over time. The quantitative "green" strategies provided by NMAS 100 can be used in market making, in portfolio management to improve asset allocation decisions by benchmarking portfolio sentiment and calculating industry weightings, in fundamental analysis to forecast stock, industry, and market outlooks, in risk management to better understand unusual risks to the portfolio and develop potential sentiment protections, and to track and benchmark public perception and media coverage and to do the same for competitors.
NMAS 100可以自动分析新闻内容,并且接近实时地生成交易(例如买入/持有/卖出)信号和/或更新绿色评分和/或复合环境索引信息。如本文所使用的,术语“接近实时”意味着在一秒内。然而,结合NMAS使用的数据的范围越广,响应时间就可能越长。为了缩短响应时间,可以考虑数据/内容的较小窗口/数量。此外,NMAS可以被配置成保持滚动数据集,以使得其仅仅对现有评分和报告进行更新,并且在任何给定时刻仅仅基于来自任意源的新发现、接收或发布的内容进行处理(“阅读”以及评分和预测)。NMAS接近实时地扫描和分析关于数千个公司的新闻和社交媒体内容,并且将结果馈送到定量策略和预测性模型中。NMAS输出可以被用来激励跨市场、资产分类和所有交易频率的定量策略,支持人为决策制定,并且有助于风险管理以及投资和资产分配决策。NMAS 100 can automatically analyze news content and generate trading (e.g., buy/hold/sell) signals and/or update green scores and/or composite environmental index information in near real time. As used herein, the term "near real time" means within one second. However, the broader the scope of data used in conjunction with the NMAS, the longer the response time may be. To shorten response time, a smaller window/amount of data/content can be considered. Furthermore, NMAS can be configured to maintain a rolling dataset so that it only updates existing scores and reports and, at any given moment, only processes ("reads" and scores and predicts) based on newly discovered, received, or published content from any source. NMAS scans and analyzes news and social media content about thousands of companies in near real time and feeds the results into quantitative strategies and predictive models. NMAS output can be used to power quantitative strategies across markets, asset classes, and all trading frequencies, support human decision making, and assist in risk management, investment, and asset allocation decisions.
可以采用多种方式和形式中的任何一个将内容接收为对NMAS 100的输入,并且本发明不依赖于输入的性质。依赖于信息的源,NMAS将应用各种技术来收集与绿色评分相关的信息。例如,如果所述源是内部源或者以其他方式采用由NMAS识别的格式,那么其可以基于标识文档中的或者与文档相关联的元数据中的字段或标记来标识与特定公司或业界或索引相关的内容。如果所述源是外部的或者不以其他方式采用由NMAS容易理解的格式,则可以采用自然语言处理和其他语言学技术来标识文本中的以及声明所涉及的公司。附加的此类技术可以被用来标识潜在增强的关联性的文本术语,例如跨以下示例性的主要维度的对分数文本:“作者情绪”——特定于文章中的每个公司关于所述项目的基调的正面、负面或中性程度的度量;“关联性”——所述报道对于特定项目的相关或实质的程度;“数量分析”——关于特定公司有多少新闻正在发生;“独特性”——所述项目在不同时间段内的新鲜或重复程度;以及标题分析——除其他之外尤其表示诸如经纪人动作、定价评论、采访、独家和复合性报导之类的特殊特征。NMAS使用丰富的元数据,例如:公司标识符;主题代码——标识主题事项;报道的阶段——警报、文章、更新等等;以及业务业界和地理分类代码;针对类似文章的索引参考。跨多个领域的元数据提供区别化内容以供由定量分析师和精密算法引擎使用。Content may be received as input to NMAS 100 in any of a variety of ways and forms, and the present invention is independent of the nature of the input. Depending on the source of the information, NMAS will apply various techniques to gather information relevant to the green score. For example, if the source is internal or otherwise in a format recognized by NMAS, it may be possible to identify content related to a particular company or industry or index based on fields or tags in metadata associated with or identifying the document. If the source is external or not otherwise in a format easily understood by NMAS, natural language processing and other linguistic techniques may be employed to identify the company in the text and to which the statement relates. Additional such techniques can be used to identify text terms of potentially enhanced relevance, such as binning text across the following exemplary primary dimensions: "Author Sentiment" - a measure of how positive, negative, or neutral the tone of the article is regarding the item, specific to each company in the article; "Relevance" - how relevant or substantive the coverage is to a particular item; "Quantity" - how much news is happening about a particular company; "Uniqueness" - how fresh or repetitive the item is over different time periods; and Headline Analysis - indicating, among other things, special features such as broker action, pricing commentary, interviews, exclusives, and composite coverage. NMAS uses rich metadata, such as: company identifiers; topic codes - identifying the subject matter; stage of coverage - alert, article, update, etc.; and business industry and geographic classification codes; index references to similar articles. Metadata across multiple domains provides differentiated content for use by quantitative analysts and sophisticated algorithmic engines.
NMAS可以利用各种和多种文本评分和元数据类型。以下是供本发明所使用的示例性类型:项目类型——警报、文章、更新、校正;项目体裁——报道的分类,即采访、独家、复合性报导等等;标题——警报或标题文本;关联性——0-1.0;普遍情绪——1、0、-1;正面、中性、负面——其提供更加详细的情绪指示;首次提到的位置——首次提到所述项目的语句位置;语句总数——被用于文章长度;公司数目——有多少家公司被标记到所述项目;词语/标志的数目——关于所述公司有多少词语/标志;词语/标志总数——新闻项目中的词语/标志总数;经纪人动作——表示经纪人动作:升级、降级、保持、无定义或者其是否为经纪人本身;价格/市场评论——用来标记描述定价/市场评论的项目;项目计数——在不同时间段内关于某一公司已发表了多少项目;链接计数——表示从12小时到7天的重复程度;话题代码——其描述所述报道是关于什么,即RCH=研究;RES=结果;RESF=结果预报;MRG=合并和收购等等;其他公司——被标记到文章的其他公司是什么;以及其他元数据——索引ID、链接参考、报道链等等。NMAS can utilize various and diverse text scoring and metadata types. The following are exemplary types for use with the present invention: Item Type - Alert, Article, Update, Correction; Item Genre - the classification of the report, i.e., interview, exclusive, composite report, etc.; Title - the alert or headline text; Relevance - 0-1.0; General Sentiment - 1, 0, -1; Positive, Neutral, Negative - which provide a more detailed indication of sentiment; First Mention Location - the location of the sentence where the item is first mentioned; Total Sentences - used for article length; Number of Companies - how many companies are tagged to the item; Number of Words/Logos - how many words/logos are mentioned about the company; Total Words/Logos - the number of words/logos in the news item. Total number of tags; Broker Action - indicates the broker action: upgrade, downgrade, hold, undefined, or whether it is the broker itself; Price/Market Commentary - used to tag items that describe pricing/market commentary; Item Count - how many items have been published about a company in different time periods; Link Count - indicates the degree of overlap from 12 hours to 7 days; Topic Code - which describes what the article is about, i.e. RCH = Research; RES = Results; RESF = Results Forecast; MRG = Mergers and Acquisitions, etc.; Other Companies - what other companies are tagged to the article; and other metadata - index ID, link reference, story chain, etc.
图1-4图示了用于执行本发明以及用于提供有效接口以供与此类计算机和基于数据库的系统进行用户交互的示例性结构组件和框架。以下是对本发明的过程和特征的实现的更加详细的描述,包括关于新闻情绪的低频工作的讨论,以及关于股权(包括不稳定性和方向)和商品的一般探索性数据分析。在示例性场景中,不意图限制本发明而仅仅是为了有助于说明,下面说明新闻元数据如何与价格相关,并且讨论新闻与价格之间的短期关系。示例性讨论审视四个股权市场(美国、英国、日本和香港)和四种商品(原油、石油产品、贵金属和谷物)。下文讨论示例性的预报模型和框架,包括用于消费新闻并且做出资产价格预报的示例性引擎的描述。以做出关于回报率、交易数量和不稳定性的短期预测为目标来审视业绩。Figures 1-4 illustrate exemplary structural components and frameworks for carrying out the present invention and for providing an effective interface for user interaction with such computer and database-based systems. The following is a more detailed description of the implementation of the processes and features of the present invention, including a discussion of low-frequency work on news sentiment, and general exploratory data analysis on equities (including volatility and direction) and commodities. In an exemplary scenario, which is not intended to limit the present invention but merely to aid illustration, the following describes how news metadata is related to prices and discusses the short-term relationship between news and prices. The exemplary discussion examines four equity markets (US, UK, Japan, and Hong Kong) and four commodities (crude oil, petroleum products, precious metals, and grains). Below is a discussion of exemplary forecasting models and frameworks, including a description of an exemplary engine for consuming news and making asset price forecasts. Performance is examined with the goal of making short-term predictions on returns, number of trades, and volatility.
NMAS可以被实现在各种部署和架构中。例如在公司结构的上下文内,NMAS数据可以经由基于web托管的一个或多个解决方案或中央服务器或者通过专用服务(例如,索引馈送)作为顾客或客户站点处的所部署的解决方案来递送。图1示出了示例性的新闻/媒体分析系统(NMAS)100,包括被适配成与中央服务提供商系统或客户端操作的处理系统中的任一者或二者集成在一起的在线信息检索系统。在该示例性实施例中,NMAS系统100包括至少一个web服务器,其可以自动控制客户端访问设备上的应用的一个或多个方面,其可以运行利用附加组件(add-on)框架而加强的应用,所述附加组件框架集成到图形用户接口或浏览器控制装置中以促进与一个或多个基于web的应用进行对接。系统100包括一个或多个数据库110、一个或多个服务器120以及一个或多个访问(例如客户端)设备130。NMAS can be implemented in a variety of deployments and architectures. For example, within the context of a corporate structure, NMAS data can be delivered via one or more web-based hosted solutions or central servers, or as a deployed solution at a customer or client site through a dedicated service (e.g., an index feed). Figure 1 illustrates an exemplary News/Media Analysis System (NMAS) 100, comprising an online information retrieval system adapted to integrate with either or both a central service provider system or a client-operated processing system. In this exemplary embodiment, NMAS system 100 includes at least one web server that can automatically control one or more aspects of applications on client access devices. It can run applications enhanced with an add-on framework integrated into a graphical user interface or browser control to facilitate interfacing with one or more web-based applications. System 100 includes one or more databases 110, one or more servers 120, and one or more access (e.g., client) devices 130.
新闻/媒体数据库110包括主数据库(内部)集112、二级数据库(外部)集114以及元数据模块116。在示例性实施例中,内部数据库112包括新闻(在该情况下由示例性的Thomson Reuters TR News表示)服务或数据库1121和馈送(在该情况下由示例性的Thomson Reuters TR News Feed表示)服务或(一个或多个)数据库1122。新闻/媒体数据库110的内部成分还可以包括内部发源的社交媒体内容。外部数据库114包括新闻(诸如以及非内部)服务或(一个或多个)数据库1141、博客数据库1142、社交媒体数据库1143和其他(一个或多个)内容数据库1144。元数据模块116包括被适配成标识、提取或应用或者以其他方式辨识与新闻报道和/或社交媒体内容相关联的元数据。此类元数据可以由NMAS 100用来对新闻报道进行预处理,例如语句分离、词性标记、文本解析、标志化等等,以促进把报道与一个或多个公司相关联以及准备用于应用计算语言学过程和情绪分析的内容。News/media database 110 includes a primary (internal) set of databases 112, a secondary (external) set of databases 114, and a metadata module 116. In the exemplary embodiment, internal database 112 includes a news service or database 1121 (represented in this case by the exemplary Thomson Reuters TR News) and a feed service or database(s) 1122 (represented in this case by the exemplary Thomson Reuters TR News Feed). The internal components of news/media database 110 may also include internally sourced social media content. External database 114 includes news (such as and non-internal) services or database(s) 1141, a blog database 1142, a social media database 1143, and other content database(s) 1144. Metadata module 116 includes components adapted to identify, extract, apply, or otherwise recognize metadata associated with news stories and/or social media content. Such metadata may be used by NMAS 100 to pre-process news stories, such as sentence separation, part-of-speech tagging, text parsing, tokenization, etc., to facilitate associating stories with one or more companies and preparing the content for application of computational linguistics processes and sentiment analysis.
采取一个或多个电子、磁性或光学数据存储设备的示例性形式的数据库110包括或者以其他方式与相应的索引(未示出)相关联。每个索引包括与对应的文档地址、标识符和其他常规信息相关联的术语和短语。数据库110经由无线或有线通信网络(诸如局域网、广域网、专用网或者虚拟专用网)耦合或可耦合到服务器120。The database 110, which takes the exemplary form of one or more electronic, magnetic, or optical data storage devices, includes or is otherwise associated with corresponding indexes (not shown). Each index includes terms and phrases associated with corresponding document addresses, identifiers, and other general information. The database 110 is coupled or couplable to a server 120 via a wireless or wired communication network (such as a local area network, a wide area network, a private network, or a virtual private network).
通常表示用于采用网页或其他标记语言形式(具有相关联的小应用程序、ActiveX控制、远程调用对象或者其他相关的软件和数据结构)提供数据的一个或多个服务器的服务器120构成以服务各种“厚度”的服务客户端。更特别地,服务器120包括处理器模块121、存储器模块122,其包括订户数据库123、绿色评分/复合索引模块124 125和用户接口模块126、训练/学习模块127以及分类器模块128。处理器模块121包括一个或多个本地或分布式处理器、控制器或虚拟机。采取一个或多个电子、磁性或光学数据存储设备的示例性形式的存储器模块122存储订户数据库123、绿色评分/索引复合模块124(诸如针对基于本发明的预测性建模的与公司相关的预测性分析)、情绪处理模块125(诸如对于用户可用于进一步研究感兴趣的公司的其他财经服务)以及用户接口模块126。Server 120, generally representing one or more servers providing data in the form of web pages or other markup languages (with associated applets, ActiveX controls, remote call objects, or other related software and data structures), is configured to serve various "thicknesses" of service clients. More specifically, server 120 includes a processor module 121, a memory module 122, which includes a subscriber database 123, green score/composite index modules 124-125, a user interface module 126, a training/learning module 127, and a classifier module 128. Processor module 121 comprises one or more local or distributed processors, controllers, or virtual machines. Memory module 122, exemplarily in the form of one or more electronic, magnetic, or optical data storage devices, stores subscriber database 123, green score/composite index module 124 (such as for predictive analysis related to a company based on the predictive modeling of the present invention), sentiment processing module 125 (such as for other financial services that users can use to further research companies of interest), and user interface module 126.
订户数据库123包括用于控制、经办和管理数据库110的现购现付(pay-as-you-go)或者基于预订的访问的订户相关的数据。在该示例性实施例中,订户数据库123包括一个或多个用户偏好(或者更一般地用户)数据结构1231,包括用户标识数据1231A、用户预订数据1231B和用户偏好1231C,并且还可以包括用户所存储的数据1231E。在该示例性实施例中,用户数据结构的一个或多个方面涉及各种搜索和接口选项的用户定制。例如,用户ID1231A可以包括与具有对经由NMAS 100分布的绿色评分和/或环境复合索引服务的预订的用户相关联的用户登录和屏幕名信息。绿色评分/复合索引模块124包括用于处理上文所描述的功能的软件和功能,并且例如可以结合情绪处理模块126、训练模块127和分类器模块128中的一个或多个针对一个或多个数据库110而被应用,以基于从数据库或全集110接收到的数据来生成或更新用于公司的绿色分数,或者生成或更新由股票集构成的复合索引。例如,利用某一形式的验证而应用的来自数据库110的训练数据集或初始数据集可以被用来训练或验证NMAS 100的性能,以供采用正在进行的方式使用,诸如以供采用由FSP提供的基于费用的服务使用。Subscriber database 123 includes data related to subscribers used to control, manage, and administer pay-as-you-go or subscription-based access to database 110. In the exemplary embodiment, subscriber database 123 includes one or more user preference (or, more generally, user) data structures 1231, including user identification data 1231A, user subscription data 1231B, and user preferences 1231C. It may also include user-stored data 1231E. In the exemplary embodiment, one or more aspects of the user data structures relate to user customization of various search and interface options. For example, user ID 1231A may include user login and screen name information associated with a user who subscribes to the Green Rating and/or Environmental Composite Index services distributed via NMAS 100. The green score/composite index module 124 includes software and functionality for handling the functions described above and can be applied, for example, in conjunction with one or more of the sentiment processing module 126, the training module 127, and the classifier module 128 against one or more databases 110 to generate or update green scores for companies, or to generate or update composite indices composed of sets of stocks, based on data received from the databases or corpora 110. For example, a training data set or initial data set from the databases 110 applied with some form of validation can be used to train or validate the performance of the NMAS 100 for use in an ongoing manner, such as for use with fee-based services provided by an FSP.
信息集成工具(IIT)框架或接口模块126(或者软件框架或平台)包括机器可读和/或可执行指令集以供完全或部分地定义软件以及具有其一个或多个部分的与一个或多个应用集成或协作的相关的用户接口。如图2中所示,NMAS包括与IIT 126和元数据模块116协作的新闻/社交媒体处理引擎(NSMPE),所述新闻/社交媒体处理引擎(NSMPE)包括一个或多个搜索引擎或者可以与一个或多个搜索引擎协作,以供针对元数据进行接收和处理以及聚合、评分和过滤、推荐以及呈现结果。在示例性实施例中,NSMPE包括一个或多个特征引擎206、预测性建模模块207、学习或训练引擎或模块208以及绿色评分、复合索引模块209,以实现本文所描述的功能。The Information Integration Tool (IIT) framework or interface module 126 (or software framework or platform) includes a set of machine-readable and/or executable instructions for fully or partially defining software and associated user interfaces that integrate or collaborate with one or more applications, including one or more of its components. As shown in FIG2 , the NMAS includes a News/Social Media Processing Engine (NSMPE) that collaborates with IIT 126 and metadata module 116. The NSMPE includes or can collaborate with one or more search engines to receive and process metadata, aggregate, score, filter, recommend, and present results. In an exemplary embodiment, the NSMPE includes one or more feature engines 206, a predictive modeling module 207, a learning or training engine or module 208, and a green scoring and composite indexing module 209 to implement the functionality described herein.
参照图1,访问设备130(诸如客户端设备)通常表示一个或多个访问设备。在示例性实施例中,访问设备130采取个人计算机、工作站、个人数字助理、移动电话或者能够提供与服务器或数据库的有效用户接口的任何其他设备的形式。具体来说,访问设备130包括处理器模块131一个或多个处理器(或处理电路)131、存储器132、显示器133、键盘134以及图形指针或选择器135。处理器模块131包括一个或多个处理器、处理电路或控制器。在示例性实施例中,处理器模块131采取任何方便或所期望的形式。耦合到处理器模块131的是存储器132。存储器132为操作系统136、浏览器137、文档处理软件138存储代码(机器可读或可执行指令)。在示例性实施例中,操作系统136采取Microsoft Windows操作系统的某一版本的形式,并且浏览器137采取Microsoft Internet Explorer的某一版本的形式。操作系统136和浏览器137不仅接收来自键盘134和选择器135的输入,而且还支持在显示器133上渲染图形用户接口。在启动处理软件时,集成的信息检索图形用户接口139被定义在存储器132中并且渲染在显示器133上。在渲染时,接口139呈现与一个或多个交互式控制特征(或用户接口元件)相关联的数据。Referring to Figure 1, access device 130 (such as a client device) generally represents one or more access devices. In an exemplary embodiment, access device 130 takes the form of a personal computer, workstation, personal digital assistant, mobile phone, or any other device capable of providing an effective user interface with a server or database. Specifically, access device 130 includes a processor module 131 (one or more processors (or processing circuits) 132), memory 132, a display 133, a keyboard 134, and a graphical pointer or selector 135. Processor module 131 includes one or more processors, processing circuits, or controllers. In an exemplary embodiment, processor module 131 takes any convenient or desired form. Coupled to processor module 131 is memory 132. Memory 132 stores code (machine-readable or executable instructions) for operating system 136, browser 137, and document processing software 138. In an exemplary embodiment, operating system 136 takes the form of a version of the Microsoft Windows operating system, and browser 137 takes the form of a version of Microsoft Internet Explorer. Operating system 136 and browser 137 not only receive input from keyboard 134 and selector 135, but also support rendering of a graphical user interface on display 133. When the process software is launched, an integrated information retrieval graphical user interface 139 is defined in memory 132 and rendered on display 133. When rendered, interface 139 presents data associated with one or more interactive control features (or user interface elements).
在使用本发明的操作系统的一个实施例中,安装附加组件框架并且将服务器120上的一个或多个工具或API加载到一个或多个客户端设备130上。在示例性实施例中,这需要用户把客户端访问设备(诸如访问设备130)中的浏览器引导到针对在线信息检索系统(诸如来自Thomson Reuters Financial的供应品和其他系统)的网际协议(IP)地址,并且然后使用用户名和/或密码登录到所述系统上。成功的登录导致基于web的接口从服务器120输出、被存储在存储器132中并且由客户端访问设备130显示。所述接口包括用于利用一个或多个应用的对应的工具栏插件来发起信息集成软件的下载的选项。如果发起了下载选项,则下载确保客户端访问设备与信息集成软件兼容并且检测访问设备上的哪些文档处理应用与信息集成软件兼容的管理软件。通过用户批准,适当的软件被下载并且安装在客户端设备上。在一种替换方案中,中间“企业”网络服务器可以接收所述框架、工具、API和附加组件软件中的一个或多个,以供使用内部过程加载到一个或多个客户端设备130上。In one embodiment using the operating system of the present invention, an add-on framework is installed and one or more tools or APIs on server 120 are loaded onto one or more client devices 130. In an exemplary embodiment, this requires a user to direct a browser on a client access device (such as access device 130) to the Internet Protocol (IP) address for an online information retrieval system (such as offerings from Thomson Reuters Financial and other systems) and then log in to the system using a username and/or password. A successful login results in a web-based interface being output from server 120, stored in memory 132, and displayed by client access device 130. The interface includes an option to initiate the download of information integration software using corresponding toolbar plug-ins for one or more applications. If the download option is initiated, download management software ensures that the client access device is compatible with the information integration software and detects which document processing applications on the access device are compatible with the information integration software. Upon user approval, the appropriate software is downloaded and installed on the client device. In an alternative, an intermediate "enterprise" network server may receive one or more of the framework, tools, APIs, and add-on software for loading onto one or more client devices 130 using internal processes.
一旦以任意方式安装,则然后可以利用文档处理应用来在上下文中向用户呈现在线工具接口。可以同时调用针对一个或多个应用的附加组件软件。附加组件菜单包括web服务或应用和/或被本地托管的工具或服务的列表。用户经由工具接口进行选择,诸如经由指示设备人工选择。一旦进行了选择,则执行所选工具,或者更精确地说是其相关联的指令。在示例性实施例中,这需要与服务器120上的对应指令或web应用进行通信,其继而可以使用作为附加组件框架的一部分而存储在托管应用上的一个或多个API来提供托管字处理应用的动态脚本化和控制。Once installed in any manner, the document processing application can then be utilized to present the online tool interface to the user in context. Add-on software for one or more applications can be called simultaneously. The add-on menu includes a list of web services or applications and/or locally hosted tools or services. The user makes a selection via the tool interface, such as manually via a pointing device. Once a selection is made, the selected tool, or more precisely its associated instructions, is executed. In an exemplary embodiment, this requires communication with a corresponding instruction or web application on server 120, which in turn can use one or more APIs stored on the hosted application as part of the add-on framework to provide dynamic scripting and control of the hosted word processing application.
图2图示了用于执行本文所描述的过程的示例性NMAS系统200的另一种表示,所述过程是结合硬件和软件以及通信联网的组合来执行的。在该示例中,NMAS 200提供用于搜索、检索、分析和排名的框架。NMAS 200可以与供应信息或专业财经服务提供商(FSP)(例如Thomson Reuters Financial)的系统204相结合来使用,并且包括上文所描述的信息集成和工具框架以及应用模块126。此外,在该示例中,系统200包括中央网络服务器/数据库设施201,其包括网络服务器202、来自内部和/或外部源(例如新闻报道、博客、社交媒体等等)的文档和信息的数据库203、信息/文档检索系统205(作为组件其具有特征构建模块206、预测性模块207、训练或学习模块208)以及包括绿色评分、复合索引引擎209的新闻/社交媒体处理引擎。中央设施201可以由远程用户210诸如经由网络226(例如因特网)访问。可以使用基于因特网或(万维)WEB、基于桌面或者应用WEB使能的组件的任意组合来实现系统200的各方面。该示例中的远程用户系统210包括经由计算机211(诸如PC计算机等等)操作的GUI接口,所述计算机211可以包括硬件与软件的典型组合,如关于计算机211所示出的那样包括系统存储器212、操作系统214、应用程序216、图形用户接口(GUI)218、处理器220和存储装置222,所述存储装置222可以包含诸如电子文档和信息之类的电子信息224,例如绿色分数数据流和/或报告、基于公司和/或行业的环境复合索引数据流和/或相关报告和信息。在后文中详细描述的本发明的方法和系统可以被用于向远程用户(诸如投资者)提供对可搜索数据库的访问。特别地,远程用户可以使用基于公司RIC、绿色认证列表(如在本文中的其他地方所描述的)、股票或其他名称的搜索查询来搜索数据库,以如后文中所讨论的那样检索及查看预测性分析和/或建议动作。RIC指的是被用来标识财经证书和索引的标贴类代码的Reuters证书代码,被用于在各种财经信息网络(像Thomson Reuters市场数据平台,例如Bridge、Triarch、TIB和RMDS——Reuters Market Data System(RMDS)开放数据集成平台)上查找信息。绿色认证列表可以采取“绿色RIC”等形式。客户端侧应用软件可以被存储在机器可读介质上并且包括例如由计算机211的处理器220执行的指令,并且基于web的接口屏幕的呈现促进用户系统210与中央系统211之间的交互,诸如用于进一步分析经由网络226接收并且本地存储或远程访问的数据流以及其他数据和报告的工具。操作系统214应当适于与本文所描述的系统201和浏览器功能一起使用,例如具有适当的服务包的MicrosoftWindows Vista(商务版、企业版和终极版)、Windows 7或者Windows XP Professional。所述系统可能需要远程用户或客户端机器与最小阈值级别的处理能力相兼容,例如IntelPentium III、速度(例如500MHz)、最小存储器级别和其他参数。FIG2 illustrates another representation of an exemplary NMAS system 200 for performing the processes described herein, which are executed using a combination of hardware and software, as well as communication networking. In this example, NMAS 200 provides a framework for searching, retrieving, analyzing, and ranking. NMAS 200 can be used in conjunction with a system 204 that provides information or specialized financial services providers (FSPs), such as Thomson Reuters Financial, and includes the information integration and tool framework and application modules 126 described above. Furthermore, in this example, system 200 includes a central network server/database facility 201, which includes a network server 202, a database 203 of documents and information from internal and/or external sources (e.g., news articles, blogs, social media, etc.), an information/document retrieval system 205 (including, as components, a feature building module 206, a predictive module 207, and a training or learning module 208), and a news/social media processing engine including a green scoring and composite indexing engine 209. Central facility 201 can be accessed by remote users 210, such as via a network 226 (e.g., the Internet). Aspects of system 200 can be implemented using any combination of Internet or (World Wide) Web-based, desktop-based, or application Web-enabled components. The remote user system 210 in this example includes a GUI interface operated via a computer 211 (such as a PC computer, etc.), which may include a typical combination of hardware and software, including, as shown with respect to computer 211, system memory 212, operating system 214, application programs 216, graphical user interface (GUI) 218, processor 220, and storage 222. The storage 222 may contain electronic information 224, such as electronic documents and information, such as green score data streams and/or reports, company- and/or industry-based environmental composite index data streams, and/or related reports and information. The methods and systems of the present invention, described in detail below, can be used to provide remote users (such as investors) with access to a searchable database. In particular, remote users can search the database using search queries based on company RICs, green certification lists (as described elsewhere herein), stocks, or other names to retrieve and view predictive analytics and/or recommended actions, as discussed below. RICs refer to Reuters Certification Codes, a label-like code used to identify financial certificates and indexes, and are used to find information on various financial information networks, such as Thomson Reuters market data platforms such as Bridge, Triarch, TIB, and RMDS—the Reuters Market Data System (RMDS) open data integration platform. Green certification lists can take the form of "Green RICs," among others. Client-side application software can be stored on machine-readable media and include, for example, instructions executed by processor 220 of computer 211, and presents web-based interface screens to facilitate interaction between user system 210 and central system 211, such as tools for further analysis of data streams and other data and reports received via network 226 and stored locally or accessed remotely. The operating system 214 should be suitable for use with the system 201 and browser functionality described herein, such as Microsoft Windows Vista (Business, Enterprise, and Ultimate editions), Windows 7, or Windows XP Professional with appropriate service packs. The system may require that the remote user or client machine be compatible with a minimum threshold level of processing power, such as an Intel Pentium III, speed (e.g., 500 MHz), minimum memory level, and other parameters.
因而所描述的配置是许多配置中的一些,并且不对本发明进行限制。中央系统201可以包括服务器、计算机和数据库的网络,诸如通过LAN、WLAN、以太网、令牌网、FDDI环或其他通信网络基础结构。若干合适的通信链接中的任何均可用,诸如无线、LAN、WLAN、ISDN、X.25、DSL和ATM类型网络中的一种或组合。用以执行与系统201相关联的功能的软件可以包括桌面或服务器或网络环境内的自包含式应用,并且可以利用本地数据库(诸如SQL 2005或以上版本、或者SQL Express、IBM DB2或其他合适的数据库)来存储文档、汇集以及与处理此类信息相关联的数据。在示例性实施例中,各种数据库可以是关系型数据库。在关系型数据库的情况下,创建各种数据表并且使用SQL或者本领域内已知的某种其他数据库查询语言将数据插入到这些表中和/或从这些表中选择数据。在使用表和SQL的数据库的情况下,例如MySQLTM、SQLServerTM、Oracle 8ITM、10GTM或者某种其他合适的数据库应用之类的数据库应用可以被用来管理数据。如本领域内已知的那样,这些表可以被组织成RDS或对象关系型数据模式(ORDS)。The configurations described are, therefore, some of many and do not limit the present invention. Central system 201 may include a network of servers, computers, and databases, such as via a LAN, WLAN, Ethernet, token ring, FDDI ring, or other communication network infrastructure. Any of a number of suitable communication links may be used, such as one or a combination of wireless, LAN, WLAN, ISDN, X.25, DSL, and ATM-type networks. The software used to perform the functions associated with system 201 may include self-contained applications within a desktop, server, or network environment, and may utilize a local database (such as SQL 2005 or above, SQL Express, IBM DB2, or other suitable database) to store documents, collections, and data associated with processing such information. In an exemplary embodiment, the various databases may be relational databases. In the case of relational databases, various data tables are created, and data is inserted into and/or selected from these tables using SQL or some other database query language known in the art. In the case of a database that utilizes tables and SQL, a database application such as MySQL ™ , SQLServer ™ , Oracle 81 ™ , 10G ™ , or some other suitable database application may be used to manage the data. As is known in the art, these tables may be organized into an RDS or Object-Relational Data Schema (ORDS).
在本发明的一种示例性方法中并且参照图3的流程,执行以下处理。首先在步骤302处,用户从来自内部或外部源的合适的新闻/社交媒体源(新闻馈送、博客、网站等等)获得感兴趣的信息和内容。在步骤304处,系统对所获得的信息应用预处理以标识所嵌入的元数据或其他描述符,处理关于一个或多个公司的文本、词语、短语和属性关联性。在步骤306处,系统应用情绪分析并且得到与所获得并处理的信息相关联的一个或多个情绪分数,如其涉及其中标识的感兴趣的公司。在步骤308处,系统可选地(如本文中其他地方所讨论的)可以应用风险分类法,以得到与绿色分数或复合索引相关的单独分数或指示或者导出分数或指示。在步骤310处,系统应用使用情绪分数来得到绿色分数的预测性模型,以例如得到与每个公司相关联的所预测的状况或价格行为。在步骤312处,对于均具有绿色分数的公司集,系统生成所述绿色分数集的复合索引的表达,例如所述索引表示对应的股票价格集的预测行为和/或要根据预测行为而采取的建议动作(例如买入、卖出或持有)。In an exemplary method of the present invention and with reference to the process flow of FIG3 , the following process is performed. First, at step 302 , a user obtains information and content of interest from an appropriate news/social media source (news feed, blog, website, etc.) from an internal or external source. At step 304 , the system applies pre-processing to the obtained information to identify embedded metadata or other descriptors, processing text, words, phrases, and attribute associations about one or more companies. At step 306 , the system applies sentiment analysis and derives one or more sentiment scores associated with the obtained and processed information, as it relates to the companies of interest identified therein. At step 308 , the system may optionally (as discussed elsewhere herein) apply a risk classification method to derive individual scores or indicators associated with a green score or composite index, or to derive a score or indicator. At step 310 , the system applies a predictive model that uses the sentiment scores to derive the green score, for example, to derive a predicted condition or price behavior associated with each company. At step 312, for a set of companies that all have green scores, the system generates an expression of a composite index of the set of green scores, e.g., the index represents the predicted behavior of the corresponding set of stock prices and/or a recommended action to be taken based on the predicted behavior (e.g., buy, sell, or hold).
图4是图示了数据库和文档处理、情绪和绿色评分的流程图,其将本发明的预测性建模方面用作采用本发明的系统的输入和输出,诸如图3的方法。例如,外部文档、新闻、社交媒体和其他信息(诸如新闻文章以及传统媒体和新媒体源、博客、社交媒体)被视为对诸如前所述的新闻/社交媒体处理引擎的输入,所述新闻/社交媒体处理引擎可以包括组合的或单独的外部消息引擎和内部数据馈送消息引擎。内部新闻馈送等等(例如TR Feeds、Reuters News、Westlaw、Curated馈送)由内部数据馈送文档处理模块来处理。组合的新闻馈送由情绪评分引擎进一步处理并且最终根据预测性模型来处理,以输出用于公司的绿色评分和/或与公司集的环境业绩或认证相关的复合索引。采用该方式,本发明提供相应的公司的预测性分析或者诸如建议动作(买入、卖出或持有)之类的其他输出。另一输出可以采取与绿色评分或复合索引相关的数据流或馈送的形式,并且可以被递送到财经服务的订户并且在本地进一步被处理。又另一输出可以是智能警报服务。此外,桌面附加组件可以包括用以显示各种输出和/或接收作为对其响应的输入的方式。Figure 4 is a flow chart illustrating database and document processing, sentiment, and green scoring, which utilizes the predictive modeling aspects of the present invention as inputs and outputs for a system employing the present invention, such as the method of Figure 3. For example, external documents, news, social media, and other information (such as news articles and traditional and new media sources, blogs, and social media) are considered inputs to a news/social media processing engine such as the one described above, which may include a combined or separate external messaging engine and an internal data feed messaging engine. Internal news feeds, such as TR Feeds, Reuters News, Westlaw, and Curated feeds, are processed by an internal data feed document processing module. The combined news feeds are further processed by the sentiment scoring engine and ultimately processed according to a predictive model to output a green score for a company and/or a composite index related to the environmental performance or certifications of a set of companies. In this manner, the present invention provides predictive analysis of the corresponding company or other outputs such as a recommended action (buy, sell, or hold). Another output may take the form of a data stream or feed related to the green score or composite index, which may be delivered to subscribers of a financial service and further processed locally. Yet another output may be an intelligent alert service. Additionally, desktop add-ons may include ways to display various outputs and/or receive input in response thereto.
基于信息的公司已经做出许多努力以收集和/或分析文档和信息的较大全集或总体,包括传统和新的时代媒体、博客、网页等等。例如,已经使用网络爬虫(webcrawler)和截屏器来提取可用信息和数据以供后续处理和分析,例如格式化/重新格式化、结构化/非结构化数据。公司可以使用该信息来创建或改善顾客心目中的企业或产品形象或身份,这在CRS和环境责任的上下文中越来越重要。能够从信息(例如文本)中辨识由表达所表示的任何潜在的“情绪”或“意见”的系统在形成预测性模型方面非常有用。这常常被称为情绪或意见挖掘,并且也被称为“感觉”或“情感”计算。这些技术常常使用自然语言处理,并且被设计成识别和解释人类情绪(意见、情感或感情,例如高兴、悲伤、恐惧、重要、不重要、正面、负面)以及基于所检测到的人类情感或感情而生成响应。Information-based companies have made significant efforts to collect and/or analyze large corpora or collections of documents and information, including traditional and new-age media, blogs, web pages, and the like. For example, web crawlers and screenshots have been used to extract usable information and data for subsequent processing and analysis, such as formatting/reformatting, structured/unstructured data. Companies can use this information to create or improve the image or identity of their business or products in the minds of customers, which is increasingly important in the context of consumer relations and environmental responsibility. Systems that can identify any underlying "emotions" or "opinions" expressed by expressions in information (e.g., text) are very useful in forming predictive models. This is often referred to as sentiment or opinion mining, and is also known as "feeling" or "emotional" computing. These techniques often utilize natural language processing and are designed to identify and interpret human emotions (opinions, feelings, or emotions, such as happiness, sadness, fear, importance, indifference, positive, negative) and generate responses based on the detected human emotions or feelings.
更特别地,语义分析对文本进行解释以辨识情感或意见的表达,并且可以被用来生成具有语义觉知的结果。此类系统可以是基于本体论(例如人类感情本体论)和语言学资源(例如WordNet-Affect(WNA))。通过把所述系统的使用扩展到超过传统新闻源,NMAS可以采用所述技术来解释和处理非传统渠道/源(诸如博客、维基、在线论坛、留言板、聊天室、社交媒体网络等等)中表达的意见和情绪,以便确定绿色情绪和绿色分数。利用所有媒体源而特别是对于缺少历史验证内部过程的“新媒体”源,所述系统还可以关于消息的(实际的或感知的(短期))准确性指派某一级别的验证。此外,所述系统可以被配置成标识“假”新闻并且在预测股票价格行为时预期此类“新闻”的短期效应。More specifically, semantic analysis interprets text to identify expressions of sentiment or opinion and can be used to generate semantically aware results. Such systems can be based on ontologies (e.g., the Human Feeling Ontology) and linguistic resources (e.g., WordNet-Affect (WNA)). By extending the use of the system beyond traditional news sources, NMAS can employ the technology to interpret and process opinions and emotions expressed in non-traditional channels/sources (such as blogs, wikis, online forums, message boards, chat rooms, social media networks, etc.) in order to determine green sentiment and green scores. Utilizing all media sources, and particularly for "new media" sources that lack historical internal verification processes, the system can also assign a certain level of verification regarding the (actual or perceived (short-term)) accuracy of the message. Furthermore, the system can be configured to identify "fake" news and anticipate the short-term effects of such "news" when predicting stock price behavior.
通过示例的方式,本文所描述的情绪评分功能可以由Reuters NewsScopeSentiment Engine(RNSE)执行。RNSE使得客户能够利用唯一的新闻/社交媒体情绪集、关联性和用于算法交易系统的新颖性指示符以及风险管理和人类决策支持过程。所述服务利用语言学模型,所述语言学模型针对在当前供应品中支持的关于40项商品和能源资产以及超过10000个公司的新闻/社交媒体以毫秒对情绪进行评分。算法交易对于现金股权市场以及诸如外汇、商品和能源市场之类的其他流动资产类别中的卖出和买入侧市场参与者这二者均有用。商品市场为机构投资者和专属交易员提供了增长和多样化投资策略的大量机会。在给定全球商品和能源市场的增长、价格不稳定性以及越来越多地将该资产类别采用到活动的交易策略中的情况下,针对相关定量解决方案的顾客需求不断增长。所述情绪分数和作为结果的绿色分数或复合索引可以被交易台和定量研究分析师用来更好地对资产价格的变动进行建模。客户具有对历史数据的访问,这允许其回溯测试系统针对其交易和投资策略的适用性。By way of example, the sentiment scoring functionality described herein can be performed by the Reuters NewsScope Sentiment Engine (RNSE). RNSE enables clients to leverage a unique set of news/social media sentiment, correlation, and novelty indicators for algorithmic trading systems, as well as risk management and human decision support processes. The service utilizes a linguistic model that scores sentiment in milliseconds for news/social media feeds related to 40 commodity and energy assets and over 10,000 companies currently supported in the offering. Algorithmic trading is useful for both buy-side and sell-side market participants in cash equity markets, as well as other liquid asset classes such as foreign exchange, commodities, and energy. Commodity markets offer numerous opportunities for institutional investors and proprietary traders to grow and diversify their investment strategies. Given the growth, price volatility, and increasing adoption of this asset class in active trading strategies across global commodity and energy markets, client demand for relevant quantitative solutions continues to grow. The sentiment scores and resulting green scores or composite indices can be used by trading desks and quantitative research analysts to better model asset price movements. Clients have access to historical data, allowing them to backtest their systems for suitability against their trading and investment strategies.
图5是表示用于产生情绪以供在绿色评分中使用的示例性方法中的步骤的流程图,例如以供使用社交媒体和新闻内容来对公共和私有公司定绿色性基准。用于由NMAS100进行处理的示例性数据源包括:新机构专线源(例如AFP、AP、TR、Reuters、Bloomberg)、社交媒体(博客、twitter、RSS、Gigaom、NWCleanTech、ClimateWire)、以及基于因特网/Web的源(例如CNN.com、WSJ.com、lesoir.be)。在当今的环境中,社交媒体常常提供比传统新闻渠道更加及时的信息源。例如,博主可以张贴关于“公司A”的评论,该评论和进一步的评述在最终被公司联合组织和传统新闻报道/源提到之前在社交媒体源上被注意到。这在“绿色”问题和内容的情况下看起来特别真实。通过审视基于社交媒体的情绪,本发明在关于绿色问题预测公司行为和股票价格方面响应更快。在图5的示例中执行以下分析:实体提取(例如对象、公司、位置等等)、源、作者、新闻数量、与特定分类法/主题(例如绿色)相关、事实提取、话题代码指派、分类指派、分析基调、指派情绪(+或-)、证书代码指派(例如RIC、绿色RIC)。由分析源数据所得到的输出可以采取以下形式的任何一个以供递送:针对给定分类法针对给定公司的情绪/分数的实时流(以及历史数据库);表示复合复合索引的多于一个公司的情绪/分数的实时流(以及历史数据库);以电子消息的形式的警报服务,其指示针对某一公司的索引在给定时间段内具有非常多于预设%;和/或以电子消息的格式的警报服务,其指示针对某一公司的索引在由用户/系统预设的给定时间段内具有非常多于由用户/系统的预设%。可递送的输出的接收者然后可以按期望进一步处理所述输出。FIG5 is a flowchart illustrating the steps in an exemplary method for generating sentiment for use in green scoring, such as for benchmarking the greenness of public and private companies using social media and news content. Exemplary data sources for processing by NMAS 100 include: news agency wire sources (e.g., AFP, AP, TR, Reuters, Bloomberg), social media (blogs, Twitter, RSS, Gigaom, NWCleanTech, ClimateWire), and internet/web-based sources (e.g., CNN.com, WSJ.com, lesoir.be). In today's environment, social media often provides a more timely source of information than traditional news channels. For example, a blogger may post a comment about "Company A," and the comment and further comments may be noticed on a social media feed before the company is ultimately mentioned in syndication and traditional news reports/sources. This is particularly true in the case of "green" issues and content. By examining social media-based sentiment, the present invention can be more responsive in predicting corporate behavior and stock prices regarding green issues. In the example of FIG5 , the following analyses are performed: entity extraction (e.g., object, company, location, etc.), source, author, number of news items, relevance to a particular taxonomy/theme (e.g., green), fact extraction, topic code assignment, classification assignment, analytical tone, assigned sentiment (+ or -), certification code assignment (e.g., RIC, green RIC). The output resulting from analyzing the source data can be delivered in any of the following forms: a real-time stream (and historical database) of sentiment/scores for a given company for a given taxonomy; a real-time stream (and historical database) of sentiment/scores for more than one company in a composite composite index; an alert service in the form of an electronic message indicating that the index for a certain company has significantly more than a preset % within a given time period; and/or an alert service in the form of an electronic message indicating that the index for a certain company has significantly more than a preset % within a given time period preset by the user/system. The recipient of the deliverable output can then further process the output as desired.
图6是表示采用网站形式的绿色群体的表达的图表。所述群体可以包括访问和利用现有的资源和工具。例如,所述群体包括聚合资产、分析法和工具资产、以及分布资产,以向用户(诸如投资者和投资群体中的那些)提供健壮且高效的体验。在该示例中,聚合资产包括:新闻;StarMine;法律实体;GRID;NOVUS;社交媒体;网站;众包软件;Moreover/InfoEngine。分析法资产可以包括:新闻情绪引擎;OpenCalais;Lipper基准;速度分析法;机器学习工具;绿色情绪;绿色分类法;广泛文本分析法(Lexalytics);以及警报(Psydex)。分布资产可以包括:Eikon/Omaha;DataScope;Elektron;企业服务门户;内容市场;IDN/RIC/RFA;Reuters.com博客;新闻档案;(一个或多个)绿色网站以及博客群体。Figure 6 is a diagram illustrating a representation of a green community in the form of a website. The community can include access to and leverage existing resources and tools. For example, the community includes aggregated assets, analytics and tool assets, and distributed assets to provide a robust and efficient experience for users (such as investors and those in the investment community). In this example, aggregated assets include: News; StarMine; Legal Entities; GRID; NOVUS; Social Media; Website; Crowdsourcing Software; Moreover/InfoEngine. Analytics assets may include: News Sentiment Engine; OpenCalais; Lipper Benchmarks; Velocity Analytics; Machine Learning Tools; Green Sentiment; Green Taxonomy; Extensive Text Analytics (Lexalytics); and Alerts (Psydex). Distributed assets may include: Eikon/Omaha; DataScope; Elektron; Enterprise Services Portal; Content Marketplace; IDN/RIC/RFA; Reuters.com Blog; News Archive; (One or More) Green Websites; and Blog Community.
使用本文所描述的NMAS 100系统和相关技术,本发明通过提供智能信息和分析工具来监视和预测绿色行为在公司和索引级别处的影响而解决了广泛的一组需求。本发明可以被用来访问被标记到单独的公司的绿色新闻的历史数据库,跟踪具有相关绿色评分的重大新闻的实时警报,监视社交媒体源并且跟踪绿色倡议或事件,发布/接收针对不同公司的绿色情绪分数,以及利用群体工具来监视对等行为。绿色资产管理者可以使用本发明来实现和监视对绿色投资目标和要求的坚持以及标识Alpha生成策略。企业可以按更加内向型(inward-directed)的方式来使用本发明,以供进行品牌监视以及以供实现和评价CSR和其他相关倡议。管理机构(例如环境保护署)可以使用本发明以供监视和监督绿色合规性以及以供输入到绿色立法中。Using the NMAS 100 system and related technologies described herein, the present invention addresses a broad set of needs by providing intelligent information and analytical tools to monitor and predict the impact of green behavior at the company and index level. The present invention can be used to access a historical database of green news tagged to individual companies, track real-time alerts for major news with associated green scores, monitor social media feeds and track green initiatives or events, publish/receive green sentiment scores for different companies, and utilize group tools to monitor peer behavior. Green asset managers can use the present invention to implement and monitor adherence to green investment goals and requirements and identify alpha generation strategies. Enterprises can use the present invention in a more inward-directed manner for brand monitoring and for implementing and evaluating CSR and other related initiatives. Regulatory agencies (e.g., the Environmental Protection Agency) can use the present invention to monitor and supervise green compliance and for input into green legislation.
现在参照图7,并且在本发明的绿色情绪复合索引方面,作为其核心基础NMAS 100可以具有机器学习和人工智能(AI)能力的组合,其提供智能信息以供在分析公共和私有公司的绿色行为的影响中使用。NMAS 100的作为结果的输出可以采用绿色情绪公司和复合索引、智能警报和/或桌面客户端/接口和工具集的形式。NMAS 100可以利用专门针对与公司和行业相关的环境主体进行评分的高度专业化的分类法。每个源将具有其自身的有细微差别的分类法和针对索引计算(例如由Velocity Analytics进行)的加权。一旦操作中,AI可以适于改变的市场状况,并且把所述分类法扩展为包括新发展的行话(lingo)并且突显与股权价格变动最相关的文本模式。在实现中,本发明可以提供对于绿色投资的分类,SEC中的绿色警报可以被触发,投资者可以基于绿色RIC或分类进行交易,社交媒体成分被添加到总体绿色投资群体中,并且绿色数据馈送可以被递送以供由投资者进一步处理。Referring now to FIG. 7 , and in relation to the present invention's Green Sentiment Composite Index, the NMAS 100, as its core foundation, may incorporate a combination of machine learning and artificial intelligence (AI) capabilities, providing intelligent information for use in analyzing the impact of green behavior by public and private companies. The resulting outputs of the NMAS 100 may take the form of green sentiment company and composite indexes, intelligent alerts, and/or a desktop client/interface and toolset. The NMAS 100 may utilize highly specialized taxonomies specifically for scoring environmental attributes relevant to companies and industries. Each source will have its own nuanced taxonomy and weighting for index calculation (e.g., performed by Velocity Analytics). Once operational, the AI can adapt to changing market conditions, expanding the taxonomy to include newly developed lingo and highlighting textual patterns most relevant to equity price movements. In implementation, the present invention can provide a categorization for green investments, trigger green alerts within the SEC, enable investors to trade based on green RICs or categorizations, add social media components to the overall green investment community, and deliver green data feeds for further processing by investors.
诸如InfoEngine之类的服务提供twitter、博客、在线新闻馈送以及其他类型的第三方内容的现成(out-of-the-box)的聚合。例如,诸如InfoEngine之类的内容聚合商,诸如Lexalytics之类的计算引擎,以及群体网站。一旦被馈送到服务器中,OpenCalais/ClearForest例如将被用于智能标记,这有助于在馈送之间区分。一旦应用了分类法和对应的算法,则计算引擎(诸如Lexalytics)然后将对文章进行评分。Services like InfoEngine offer out-of-the-box aggregation of Twitter, blogs, online news feeds, and other types of third-party content. For example, content aggregators like InfoEngine, computational engines like Lexalytics, and community websites are used. Once the feeds are fed into the server, OpenCalais/ClearForest, for example, is used for smart tagging, which helps distinguish between feeds. Once the taxonomy and corresponding algorithms are applied, computational engines like Lexalytics will then score the articles.
将基于其重要性对来自不同源的情绪分数进行加权。广泛流通的在线和新闻专线源将基于其Alexa和Nielsen评级被加权,而社交媒体源则将基于其跟随者、订户和印象而被加权。然后经加权的分数将被聚合以提供总体的“绿色情绪”。类似于分类法的演进,权重可以随着AI检测到源与公司的股权价格的更高相关性而改变。最后,构建群体网站将促进绿色社交媒体辩论,并且将被用来保持所述绿色分类法。Sentiment scores from different sources will be weighted based on their importance. Widely circulated online and newswire sources will be weighted based on their Alexa and Nielsen ratings, while social media sources will be weighted based on their followers, subscribers, and impressions. The weighted scores will then be aggregated to provide an overall "green sentiment." Similar to the evolution of the taxonomy, weights can be changed as AI detects a higher correlation between a source and a company's equity price. Finally, a community website will be built to promote the green social media debate and will be used to maintain the green taxonomy.
风险挖掘Risk Mining
图8-16是用于实现本发明的风险挖掘技术的示例。下面将更加全面地描述风险挖掘技术以供结合本发明来使用。8-16 are examples of risk mining techniques for implementing the present invention. Risk mining techniques will be described more fully below for use in conjunction with the present invention.
图8图示了风险如何随着时间具体化。最初,从大的文本数据库中提取风险P=>Q,此时其中Q代表高影响事件,P代表Q的先决条件,其在因果或统计方面与Q相联系,并且在时间上处于Q之前。除非在本文另有声明或指示,否则蕴涵符号“=>”捕获存在于P和Q之间的因果性和/或使能关系(例如P导致Q,或者P可能会使能Q)。蕴涵符号“=>”不意味着实质蕴涵。后来在时间t.sub.j处,P可能发生,这继而可能导致在时间t.sub.k处发生Q。本发明解决了自动从文本获得风险P=>Q的问题,并且描述了如何可以使用P=>Q和P来警报用户Q可能即将来临。如本文所使用的,可以是正面的或负面的术语“风险”指代涉及不确定性的事件(除非该事件已经发生),其可能由某一因素、事物、元素或过程导致。特别地,如本文所使用的,可以是正面的或负面的术语“风险”指代其中针对事件的先决条件,其中所述先决条件在因果或统计方面与所述事件相联系并且在时间上处于所述事件之前。如本文所使用的,术语“先决条件”指代与特定对象相关的声明或指示。特别地,术语“先决条件”指代直接地或者通过本发明的挖掘技术与特定事件相关的声明或指示。Figure 8 illustrates how risks materialize over time. Initially, a risk P=>Q is extracted from a large text database, where Q represents a high-impact event and P represents a prerequisite for Q, which is causally or statistically connected to Q and precedes Q in time. Unless otherwise stated or indicated herein, the implication symbol "=>" captures the causal and/or enabling relationship that exists between P and Q (e.g., P causes Q, or P may enable Q). The implication symbol "=>" does not mean substantive implication. Later, at time t.sub.j, P may occur, which in turn may cause Q to occur at time t.sub.k. The present invention solves the problem of automatically obtaining the risk P=>Q from text, and describes how P=>Q and P can be used to alert a user that Q may be imminent. As used herein, the term "risk", which can be positive or negative, refers to an event involving uncertainty (unless the event has already occurred) that may be caused by a factor, thing, element or process. In particular, as used herein, the term "risk," which may be positive or negative, refers to a condition that is a priori for an event, wherein the condition is causally or statistically linked to the event and temporally precedes the event. As used herein, the term "precondition" refers to a statement or indication related to a particular object. In particular, the term "precondition" refers to a statement or indication that is related to a particular event, either directly or through the mining techniques of the present invention.
通过使用计算设备针对风险来挖掘全集(例如(一个或多个)文本馈送的(一个或多个)集)。如本文所使用的,术语“全集”及其变形指的是一个或多个数据集,特别是包括文本数据的数字数据。全集可以包括但不限于:新闻;财经信息,包括但不限于股票价格数据及其标准偏差(不稳定性);政府和规章性报告,包括但不限于政府机构报告,诸如税务申报、医疗申报、法律申报、食品药品监督管理局(FDA)申报、证券交易委员会(SEC)申报之类的规章性申报;私有实体发表,包括但不限于年报、时事通讯、广告和新闻稿;博客;网页;事件流;协议文件;社交网络服务上的状态更新;电子邮件;短消息服务(SMS);即时聊天消息;Twitter推文;和/或其组合。计算设备对所述全集进行调查,以提取风险指示模式,并且利用风险指示种子模式作为风险标识算法的种子,以供分析师或用户进行后续风险挖掘。计算设备还可以包括用于查询计算机的接口(诸如键盘),以及用于显示来自计算机的结果的显示器。A corpus (e.g., a collection of one or more text feeds) is mined for risk using a computing device. As used herein, the term "corpus" and variations thereof refer to one or more data sets, particularly digital data including text data. The corpus may include, but is not limited to: news; financial information, including but not limited to stock price data and its standard deviation (instability); government and regulatory reports, including but not limited to government agency reports, regulatory filings such as tax filings, medical filings, legal filings, Food and Drug Administration (FDA) filings, and Securities and Exchange Commission (SEC) filings; private entity publications, including but not limited to annual reports, newsletters, advertisements, and press releases; blogs; web pages; event streams; agreement documents; status updates on social networking services; emails; short message service (SMS); instant chat messages; Twitter tweets; and/or combinations thereof. The computing device investigates the corpus to extract risk-indicating patterns and utilizes the risk-indicating seed patterns as seeds for a risk identification algorithm for subsequent risk mining by an analyst or user. The computing device may also include an interface (such as a keyboard) for querying the computer and a display for displaying results from the computer.
计算设备还可以被用来通过计算机接口(未示出)向用户警报风险,包括但不限于即将来临的风险,即有可能发生的风险,包括但不限于有可能在不久的将来或者在一个定义的时间段内发生。通常经由计算设备(未示出)来警报用户。但是本发明不限于此,而是可以合适地使用具有视觉显示器或者甚至语音通信的任何设备。如本文所使用的,术语“计算设备”指的是进行计算的设备,特别是执行高速数学或逻辑运算或者集合、存储、相关或以其他方式处理信息的可编程电子机器。示例包括(在不具有限制的情况下)大型计算机、个人计算机和手持式设备。在针对风险挖掘全集之前,本发明利用计算设备来从文本数据的一个或多个全集提取风险指示模式。如本文所使用的,风险指示模式是通过本发明的技术而发展的使可能先决条件涉及可能事件的模式。The computing device can also be used to alert a user to risks, including but not limited to imminent risks, i.e., risks that are likely to occur, including but not limited to those that are likely to occur in the near future or within a defined time period, via a computer interface (not shown). The user is typically alerted via a computing device (not shown). However, the present invention is not limited to this, and any device with a visual display or even voice communication may be suitably used. As used herein, the term "computing device" refers to a device that performs calculations, particularly a programmable electronic machine that performs high-speed mathematical or logical operations or aggregates, stores, correlates or otherwise processes information. Examples include (without limitation) mainframe computers, personal computers and handheld devices. Prior to mining a corpus for risks, the present invention utilizes a computing device to extract risk indication patterns from one or more corpora of text data. As used herein, a risk indication pattern is a pattern developed through the techniques of the present invention that relates possible prerequisites to possible events.
计算设备包含风险标识算法。利用包含风险标识算法的计算设备,针对被提供以创建风险数据库的风险指示种子模式集的实例而搜索文本数据全集,这是由风险挖掘器来进行的。全集可以包括但不限于:新闻;财经信息,包括但不限于股票价格数据及其标准偏差(不稳定性);政府和规章性报告,包括但不限于政府机构报告,诸如税务申报、医疗申报、法律申报、食品药品监督管理局(FDA)申报、证券交易委员会(SEC)申报之类的规章性申报;私有实体发表,包括但不限于年报、时事通讯、广告和新闻稿;博客;网页;事件流;协议文件;社交网络服务上的状态更新;电子邮件;短消息服务(SMS);即时聊天消息;Twitter推文;和/或其组合。全集210可以与全集110相同或者不同。The computing device includes a risk identification algorithm. Using the computing device including the risk identification algorithm, a corpus of textual data is searched by a risk miner for instances of a set of risk-indicating seed patterns provided to create a risk database. The corpus may include, but is not limited to: news; financial information, including but not limited to stock price data and its standard deviation (instability); government and regulatory reports, including but not limited to government agency reports, regulatory filings such as tax filings, medical filings, legal filings, Food and Drug Administration (FDA) filings, and Securities and Exchange Commission (SEC) filings; private entity publications, including but not limited to annual reports, newsletters, advertisements, and press releases; blogs; web pages; event streams; protocol documents; status updates on social networking services; emails; short message services (SMS); instant chat messages; Twitter tweets; and/or combinations thereof. Corpus 210 may be the same as or different from corpus 110.
在本发明的一个实施例中,使用触发关键字(例如“risk”、“threat”)来生成风险数据库。在另一个实施例中,使用规则表达(例如“(“may”)?pose(s)?(a)?threat(s)?to”(可能构成威胁))来生成风险数据库。创建候选风险语句或语句序列,并且通过以下操作来使新的模式一般化:在其上运行命名实体标记器或词性(POS)标记器以及分块器(可以通过专有名词或NP来描述实体,而不仅仅通过命名实体给出),并且以每类别的占位符来替代实体(例如“J.P. Morgan”=>“<COMPANY>”)。所生成的这些模式可以被用于重新处理所述全集,在本发明的一个实施例中在一些人类回顾之后进行,或者在另一个实施例中自动进行。然后对所提取的语句或语句序列这二者都进行验证(其是否真的是风险指示语句)并且将其解析成具有P=>Q形式的风险(即找出哪些文本跨度对应于前提“P”,哪些部分表达蕴涵“=>”,以及哪些部分表达高影响事件“Q”),这是使用但不限于以下非限制性特征来进行的:与术语“risk”具有重大统计关联的术语集(在本发明的一个实施例中,诸如逐点互信息(PMI)和对数似然性之类的统计程序或者包括但不限于通过Hearst模式归纳而获得的规则的规则被用来确定术语集);二进制地名录特征集,其中如果地名录由人类专家编译或者从手动标记的训练数据提取的风险指示术语集(“threat”、“bankruptcy”、“risk”...)则特征激发;推测性语言的指示符集;未来时间参考的实例;条件的出现;和/或因果性标记的出现。In one embodiment of the present invention, a risk database is generated using trigger keywords (e.g., "risk," "threat"). In another embodiment, a risk database is generated using regular expressions (e.g., "("may")?pose(s)?(a)?threat(s)?to"). Candidate risk statements or statement sequences are created, and new patterns are generalized by running a named entity tagger or part-of-speech (POS) tagger and chunker on them (entities can be described by proper nouns or NPs, not just given by named entities), and replacing entities with placeholders for each category (e.g., "J.P. Morgan" => "<COMPANY>"). These generated patterns can be used to reprocess the corpus, either after some human review in one embodiment of the present invention, or automatically in another embodiment. The extracted statement or sequence of statements is then both verified (whether it is really a risk-indicating statement) and parsed into a risk of the form P=>Q (i.e., finding out which text spans correspond to the premise "P", which parts express the entailment "=>", and which parts express the high-impact event "Q"), using but not limited to the following non-limiting features: a set of terms that have significant statistical association with the term "risk" (in one embodiment of the invention, statistical procedures such as pointwise mutual information (PMI) and log-likelihood or rules including but not limited to rules obtained by Hearst pattern induction are used to determine the term set); a binary gazetteer feature set, where the feature is excited if the gazetteer is compiled by a human expert or a set of risk-indicating terms ("threat", "bankruptcy", "risk"...) extracted from manually labeled training data; a set of indicators of speculative language; instances of future time references; the occurrence of conditions; and/or the occurrence of causal markers.
在本发明的一个实施例中,代用的机器学习(即用于通过示例对任务进行机器学习的技术)的变形可以被用来创建用于提取风险指示语句的基于机器学习的分类器的训练数据。由Sriharsha Veeramachaneni和Ravi Kumar Kondadadi在“Surrogate Learning—From Feature Independence to Semi-Supervised Classification”(Proceedings ofthe NAACL HLT Workshop on Semi-supervised Learning for Natural LanguageProcessing,第10-18页,Boulder,Colo.,2009年6月,计算语言学协会(ACL))中描述了一种有用的技术,其内容通过引用被并入本文中。In one embodiment of the present invention, a variation of surrogate machine learning (i.e., a technique for learning a task by learning from examples) can be used to create training data for a machine learning-based classifier for extracting risk-indicator statements. A useful technique is described by Sriharsha Veeramachaneni and Ravi Kumar Kondadadi in "Surrogate Learning—From Feature Independence to Semi-Supervised Classification," Proceedings of the NAACL HLT Workshop on Semi-supervised Learning for Natural Language Processing, pp. 10-18, Boulder, Colo., June 2009, Association for Computational Linguistics (ACL), the contents of which are incorporated herein by reference.
风险类型分类器根据风险类型的预定义分类法通过风险类型(“RT”)对每个风险模式进行分类。在本发明的一个实施例中,该分类法可以使用但不限于以下非限制性类别:政治:政府政策、公众意见、意识形态方面的改变、信条、立法、动乱(战争、恐怖主义、暴动);环境:受到污染的土地或污染责任、妨害行为(例如噪音)、许可、公众意见、内部/企业策略、环境法律或规章或实践或“影响”要求;规划:许可要求、政策和实践、土地使用、社会经济影响、公众意见;市场:需求(预报)、竞争、过时、顾客满意度、时尚;经济:财政政策、税收、成本膨胀、利率、汇率;财经:破产、利润、保险、风险分担;自然:未预见到的地面状况、天气、地震、火灾、爆炸、考古发现;项目:定义、采购策略,业绩要求、标准、领导力、组织(成熟度、投入度、胜任度和经验)、规划和质量控制、程序、劳动力和资源、沟通和文化;技术:设计完备度、操作效率、可靠性;规章:由管理机构的改变;人类:错误、不胜任、忽视、疲劳、沟通能力、文化、在黑暗中或夜间工作;犯罪:缺少安全、破坏行为、盗窃、诈骗、腐败;安全性:规章、有害物质、碰撞、坍塌、洪水、火灾、爆炸;和/或法律:立法的改变、条约。The risk type classifier classifies each risk pattern by risk type ("RT") according to a predefined taxonomy of risk types. In one embodiment of the invention, the taxonomy may use, but is not limited to, the following non-limiting categories: Political: government policy, public opinion, ideological change, doctrine, legislation, unrest (war, terrorism, riots); Environmental: contaminated land or pollution liability, nuisance (e.g., noise), permits, public opinion, internal/corporate policy, environmental laws or regulations or practices or "impact" requirements; Planning: permit requirements, policies and practices, land use, socioeconomic impacts, public opinion; Market: demand (forecast), competition, obsolescence, customer satisfaction, fashion; Economic: fiscal policy, taxes, cost inflation, interest rates, exchange rates; Finance: bankruptcy, profits, insurance, Risk sharing; Physical: unforeseen ground conditions, weather, earthquakes, fire, explosions, archaeological discoveries; Project: definition, procurement strategy, performance requirements, standards, leadership, organization (maturity, commitment, competence and experience), planning and quality control, procedures, workforce and resources, communication and culture; Technical: design completeness, operational efficiency, reliability; Regulatory: changes by governing bodies; Human: error, incompetence, neglect, fatigue, communication skills, culture, working in the dark or at night; Criminal: lack of security, vandalism, theft, fraud, corruption; Safety: regulations, hazardous materials, collisions, collapse, floods, fire, explosions; and/or Legal: changes in legislation, treaties.
风险聚类器通过相似度对数据库中的所有风险进行分组,而不强加预定义的分类法(数据驱动)。在一个实施例中可以使用Hearst模式归纳。Hearst模式归纳首先在Hearst,Marti的“WordNet: An Electronic Lexical Database and Some of its Applications”(Christiane Fellbaum,MIT Press 1998)中提到,其内容通过引用被并入本文中。在本发明的另一个示例中,由系统开发者选择数字k,并且可以使用kNN均值聚类方法。kNN聚类的进一步细节由Hastie, Trevor、Robert Tibshirani和Jerome Friedman的“The Elementsof Statistical Learning: Data Minig, Inference, and Prediction”(第二版,Springer,2009年)来描述,其内容通过引用被并入本文中。在此类情况下,风险被分组到一定数目(即k个)类别,并且然后通过选择与感兴趣的聚类具有最高相似度的聚类而被分类。在本发明的另一个实施例中使用分层聚类。替代地或附加地,可以使用k均值聚类和分层聚类这二者。The risk clusterer groups all risks in the database by similarity, without imposing a predefined taxonomy (data-driven). In one embodiment, Hearst Schema Induction can be used. Hearst Schema Induction was first described in Hearst, Marti, "WordNet: An Electronic Lexical Database and Some of its Applications" (Christiane Fellbaum, MIT Press, 1998), the contents of which are incorporated herein by reference. In another embodiment of the present invention, the number k is chosen by the system developer, and a kNN means clustering method can be used. Further details of kNN clustering are described in Hastie, Trevor, Robert Tibshirani, and Jerome Friedman, "The Elements of Statistical Learning: Data Minig, Inference, and Prediction" (2nd edition, Springer, 2009), the contents of which are incorporated herein by reference. In such a case, risks are grouped into a certain number (i.e., k) of categories and then classified by selecting the cluster with the highest similarity to the cluster of interest. In another embodiment of the present invention, hierarchical clustering is used. Alternatively or additionally, both k-means clustering and hierarchical clustering may be used.
在根据本发明的风险聚类器的一个实施例中,提供文本全集。文本全集被标志化成语句集。从经标志化的文本中提取由“*”指示的风险的所有实例。通过组织与所述风险匹配的所有填充符(即“*”)而将风险的分类法构造成树。Hearst模式归纳可以被用来归纳所述风险分类法。此外,NP分块器可以被用来找到感兴趣的边界。In one embodiment of a risk clusterer according to the present invention, a corpus of text is provided. The corpus of text is tokenized into a set of sentences. All instances of risks indicated by "*" are extracted from the tokenized text. A taxonomy of risks is constructed into a tree by organizing all filler characters (i.e., "*") that match the risk. Hearst pattern induction can be used to summarize the risk taxonomy. In addition, an NP chunker can be used to find interesting boundaries.
在根据本发明的风险聚类器的另一个实施例中,从例如风险、法律风险和法律改变创建风险分类法。如由所指示的,诸如可以与法律改变相关联的那些之类的风险被作为种子。如由所指示的,由计算设备挖掘诸如法律改变之类的法律风险。如由所指示的,还针对法律风险来挖掘风险。采用此类方式,基于风险和法律改变,存在针对法律风险的反馈。对风险和法律风险的挖掘可以包括利用词语风险或对其的等价物进行挖掘。对法律改变的挖掘不必包括词语风险。有利地,由该过程所导致的分类法包含不必包含词语“风险”本身的风险指示短语。除了用于风险类型分类的其使用之外,此类分类法还可以被用于风险挖掘模式中。In another embodiment of a risk clusterer according to the present invention, a risk taxonomy is created from, for example, risks, legal risks and legal changes. As indicated by , risks such as those that may be associated with legal changes are taken as seeds. As indicated by , legal risks such as legal changes are mined by a computing device. As indicated by , risks are also mined for legal risks. In such a manner, based on risks and legal changes, there is feedback for legal risks. Mining for risks and legal risks may include mining using the word risk or its equivalents. Mining for legal changes does not necessarily include the word risk. Advantageously, the taxonomy resulting from this process contains risk-indicative phrases that do not necessarily contain the word "risk" itself. In addition to its use for risk type classification, such a taxonomy may also be used in a risk mining model.
风险警报器执行数据库中的风险与文本馈送110中的P或Q的可能实例之间的相似度匹配操作。如果找到用于P的证据,则风险P=>Q“即将来临”。如果找到用于Q的证据,则风险P=>Q已经具体化。在本发明的一个实施例中,风险警报器直接向用户传递警告通知。The risk alerter performs a similarity matching operation between the risks in the database and possible instances of P or Q in the text feed 110. If evidence is found for P, the risk P=>Q is "imminent." If evidence is found for Q, the risk P=>Q has materialized. In one embodiment of the present invention, the risk alerter directly delivers a warning notification to the user.
因而,在检验风险数据库时,用户(例如风险分析师)可以在风险具体化之前立即采取动作,并且提高文本馈送中的即将来临的风险(“P!,...,P!,P!,P!,...P!...”)以及随着事件被展开的具体化后的风险(“Q!”)的管理的优先级,而甚至无需阅读所述文本馈送。Thus, when examining a risk database, a user (e.g., a risk analyst) can take immediate action before risks materialize and prioritize the management of both the impending risks in the text feed ("P!,...,P!,P!,P!,...P!...") and the materialized risks ("Q!") as events unfold, without even having to read the text feed.
在本发明的一个实施例中,风险警报器的输出连接到风险路由单元的输入,所述风险警报器向分析师通知其概况与风险类型RT匹配。例如,分析师可能希望知道关于环境风险。当挖掘到可能的环境事件的先决条件时,风险警报器将关于环境风险向分析师警报。例如,当在特定国家或地区内工业活动增加时,分析师可以被更改为全球变暖的环境风险。In one embodiment of the present invention, the output of a risk alerter is connected to the input of a risk routing unit. The risk alerter notifies an analyst that their profile matches a risk type RT. For example, an analyst may wish to learn about environmental risks. When the preconditions for a possible environmental event are discovered, the risk alerter will alert the analyst to the environmental risk. For example, if industrial activity increases in a particular country or region, the analyst may be redirected to the environmental risk of global warming.
在本发明的一个实施例中,如从被定义为所有过去的证券交易委员会(“SEC”)申报集的全集提取的风险描述集被匹配到从文本馈送提取的风险。为了确保与SEC商业风险公开义务的合规性,所述方法提出一种风险描述或者替代的风险描述的排名列表,以供包括在针对运营该系统的公司的草稿SEC申报中。In one embodiment of the present invention, a set of risk profiles, such as those extracted from a complete set defined as a set of all past Securities and Exchange Commission ("SEC") filings, is matched to risks extracted from a text feed. To ensure compliance with SEC business risk disclosure obligations, the method proposes a ranked list of risk profiles or alternative risk profiles for inclusion in a draft SEC filing for a company operating the system.
本发明可以使用多种方法以供风险标识。例如,如图9中所描绘的,风险挖掘可以包括:对表面字符串和命名实体标签上的规则模式的基线监视;使用聚类信息理论标识频繁与风险相关联的词语;和/或风险指示语句聚类。替代地或附加地,可以用于使用通过示例对任务进行机器学习的技术。风险标识包括查询用于风险指示模式的一个或多个全集。查询结果可以与风险指示模式的所有、基本上所有或者一些相匹配。在本发明的风险挖掘技术中还可以使用出现次数或特定风险指示模式。The present invention may use a variety of methods for risk identification. For example, as depicted in Figure 9, risk mining may include: baseline monitoring of regular patterns on surface strings and named entity labels; identification of words frequently associated with risk using cluster information theory; and/or clustering of risk-indicating statements. Alternatively or in addition, techniques for machine learning of tasks using examples may be used. Risk identification includes querying one or more full sets of risk-indicating patterns. The query results may match all, substantially all, or some of the risk-indicating patterns. Occurrence counts or specific risk-indicating patterns may also be used in the risk mining techniques of the present invention.
图10和11图示了根据本发明的风险挖掘的示例。在图10的示例1中,针对作为Q或事件的先决条件或P的术语“胆固醇”而挖掘包括所列出的新闻文章的全集。通过主体(holder)“diabetics”和目标“amputation risk”对事件Q进一步进行分类。风险类型RT是健康,并且由于有益于健康而具有正极性。出于本发明的目的,术语“风险”不仅指代负面的或有害的事件,而且还可以指代正面的或有益的结果。换句话说,风险可以具有正面影响和/或负面影响。在图11的示例2中,针对作为Q或事件的先决条件或P的术语“North Korealaunch”而挖掘包括所列出的新闻文章的全集。通过主体“North Korea”和目标“more thancondemnation” U.S.”对事件Q进一步进行分类。风险类型RT是政治,并且由于有害于世界政治而具有负极性。此外,还可以针对风险程度对此类负和/或正极性进行加权。在此类情况下,可能有益的是针对后果较小的风险在较大程度上更改用户130非常有害或非常有益的风险。10 and 11 illustrate examples of risk mining according to the present invention. In Example 1 of FIG. 10 , the full set of listed news articles is mined for the term “cholesterol” which is a prerequisite or P for a Q or event. The event Q is further classified by the holder “diabetics” and the target “amputation risk”. The risk type RT is health and has a positive polarity because it is beneficial to health. For the purposes of the present invention, the term “risk” refers not only to negative or harmful events, but also to positive or beneficial outcomes. In other words, a risk can have positive and/or negative effects. In Example 2 of FIG. 11 , the full set of listed news articles is mined for the term “North Korea launch” which is a prerequisite or P for a Q or event. Event Q is further categorized by the subject "North Korea" and the target "more than condemnation" U.S. The risk type RT is political and has a negative polarity due to its harmfulness to world politics. Furthermore, such negative and/or positive polarities can be weighted according to the degree of risk. In such cases, it may be beneficial to change the risk of user 130 to a greater degree, such as being very harmful or very beneficial, for a risk with lesser consequences.
图12图示了根据本发明的风险挖掘的另一个示例。在示例3中,对新闻文章进行挖掘。作为背景,在有限的供应是可用的情况下,针对金属锂的需求增加。许多金属从玻利维亚获得,在这篇文章发表时,该国的政府可能会被一些人认为对资本主义政府或公司不友好。如由下划线的词语和/或序列所指示的,针对各种潜在词语、词语序列和/或部分短语对该文章进行挖掘,以针对可能导致风险的事件Q的先决条件P查询该文章。存在于该文章中的风险类型包括供需风险和政治风险。FIG12 illustrates another example of risk mining according to the present invention. In Example 3, a news article is mined. As background, the demand for metallic lithium increases when a limited supply is available. Much of the metal is obtained from Bolivia, a country whose government may be considered by some to be unfriendly to capitalist governments or companies at the time the article was published. As indicated by the underlined words and/or sequences, the article is mined for various potential words, word sequences, and/or partial phrases to query the article for prerequisites P of an event Q that may lead to a risk. The types of risks present in the article include supply and demand risks and political risks.
图13图示了根据本发明的风险挖掘的另一个示例。在示例4a中,针对具有特定标志的模式即“if”和“then”而挖掘全集。挖掘提取开始或者具有这些标志的序列。序列的长度不限于任何特定长度或词语数目,而是由标志确定。序列被存储在例如计算设备中的寄存器中。然而,诸如但不限于图16中所示出的那些的模式的使用可以比使用基于关键字的排名检索更精确。FIG13 illustrates another example of risk mining according to the present invention. In Example 4a, a full set of patterns with specific flags, namely "if" and "then," is mined. Mining extracts sequences that begin or have these flags. The length of the sequence is not limited to any specific length or number of words, but is determined by the flags. The sequence is stored, for example, in a register within a computing device. However, the use of patterns such as, but not limited to, those shown in FIG16 can be more accurate than using keyword-based ranking searches.
图14图示了根据本发明的风险挖掘的另一个示例。在示例5a中,根据语句或短语的句法或语法结构来挖掘全集。在该示例中使用普通PENN Treebank(宾州树库)分类或标签或者略经修改的PENN标签。Penn Treebank的进一步细节可以在其内容通过引用被并入本文中的http://www.cis.upenn.edu/.about.treebank/(PENN Treebank主页)处找到,或者通过联系Linguistic Data Consortium,University of Pennsylvania,3600 MarketStreet,Suite 810,Philadelphia,Pa. 18104。对于英语之外的语言已经建立了对应的标签集并且为本领域技术人员所知。在该示例中,标签“PRP”指代人称代词,即示例语句中的“we”。标签“VBP”指代非第三人称单数现在时动词,即示例语句中的“expect”。标签“TO”简单地指代示例语句中的词语“to”。“VB”标签指代原形动词,即示例语句中的“be”。“RB”标签指代副词,即示例语句中的“negatively”。“IN”标签指代介词或从属连词,即示例语句中的“by”。一些常见的PENN Treebank词语P.O.S.标签包括但不限于:CC——并列连词;CD——基数词;DT——限定词;EX——存在有;FW——外来词;IN——介词或从属连词;JJ——形容词;JJR——形容词比较级;JJS——形容词最高级;LS——列表项目标记;MD——情态动词;NN——名词,单数或不可数;NNS——名词复数;NNP——专有名词单数;NNPS——专有名词复数;PDT——前置限定词;POS——所有格结束词;PRP——人称代词;PRP$——所有格代词(前序(prolog)版本PRP-S);RB——副词;RBR——副词比较级;RBS——副词最高级;RP——小品词;SYM——符号;TO——到;UH——感叹词;VB——动词原形;VBD——动词过去时;VBG——动词,动名词或现在分词;VBN——动词过去分词;VBP——动词,非第三人称单数现在时;VBZ——动词,第三人称单数现在时;WDT——Wh限定词;WP——Wh代词;WP$——所有格wh代词(前序版本WP-S);以及WRB——Wh副词。FIG14 illustrates another example of risk mining according to the present invention. In Example 5a, the corpus is mined based on the syntactic or grammatical structure of sentences or phrases. In this example, the standard PENN Treebank classification or labeling or slightly modified PENN labels are used. Further details of the Penn Treebank can be found at http://www.cis.upenn.edu/.about.treebank/ (the PENN Treebank homepage), the contents of which are incorporated herein by reference, or by contacting the Linguistic Data Consortium, University of Pennsylvania, 3600 Market Street, Suite 810, Philadelphia, Pa. 18104. Corresponding label sets have been established for languages other than English and are known to those skilled in the art. In this example, the label "PRP" refers to personal pronouns, i.e., "we" in the example sentence. The label "VBP" refers to non-third person singular present tense verbs, i.e., "expect" in the example sentence. The label "TO" simply refers to the word "to" in the example sentence. The "VB" tag refers to the infinitive verb, which is "be" in the example sentence. The "RB" tag refers to the adverb, which is "negatively" in the example sentence. The "IN" tag refers to the preposition or subordinating conjunction, which is "by" in the example sentence. Some common PENN Treebank word P.O.S. tags include but are not limited to: CC - coordinating conjunction; CD - cardinal number; DT - determiner; EX - existential; FW - foreign word; IN - preposition or subordinating conjunction; JJ - adjective; JJR - comparative adjective; JJS - superlative adjective; LS - list item marker; MD - modal verb; NN - noun, singular or uncountable; NNS - plural noun; NNP - proper noun singular; NNPS - proper noun plural; PDT - prepositive determiner; POS - possessive ending; PRP - personal pronoun; PRP$ - possessive pronouns (prelog version PRP-S); RB—adverb; RBR—comparative adverb; RBS—superlative adverb; RP—particle; SYM—sign; TO—to; UH—interjection; VB—infinitive; VBD—past tense verb; VBG—verb, gerund, or present participle; VBN—past participle verb; VBP—verb, non-third person singular present tense; VBZ—verb, third person singular present tense; WDT—Wh determiner; WP—Wh pronoun; WP$—possessive wh pronoun (prelog version WP-S); and WRB—Wh adverb.
在图15中,示例6图示了基于PENN treebank标签的另一种挖掘序列或算法。因此,如图14和15中所示,本发明的挖掘技术可以在不同准则下对相同的语句进行分析,以获得风险或者针对风险的先决条件。In Figure 15, Example 6 illustrates another mining sequence or algorithm based on the PENN treebank tags. Thus, as shown in Figures 14 and 15, the mining technology of the present invention can analyze the same statement under different criteria to obtain risks or prerequisites for risks.
在图16中,根据本发明的风险挖掘是通过词语(包括占位符)之间的二进制语法依赖性关系的序列而完成。In FIG16 , risk mining according to the present invention is performed through a sequence of binary grammatical dependency relationships between words (including placeholders).
以上描述的用于挖掘风险的示例和技术可以被单独地或者采用任何组合来使用。然而本发明不限于这些特定示例,并且其他模式或技术可以与本发明一起使用。可以根据各种排名算法对来自这些示例和/或来自本发明的技术的所挖掘的模式进行排名,诸如但不限于统计语言模型(LM)、基于图形的算法(诸如PageRank或HITS)、排名SVM或者其他合适的方法。The examples and techniques described above for mining risks can be used individually or in any combination. However, the present invention is not limited to these specific examples, and other patterns or techniques can be used with the present invention. The mined patterns from these examples and/or from the techniques of the present invention can be ranked according to various ranking algorithms, such as, but not limited to, statistical language models (LMs), graph-based algorithms (such as PageRank or HITS), ranked SVMs, or other suitable methods.
在本发明的一个方面中,提供一种用于挖掘风险的计算机实现的方法。所述方法包括:在计算设备上提供风险指示模式集;使用计算设备对全集进行查询,以通过使用至少部分地基于与所述全集相关联的风险指示模式集的风险标识算法来标识潜在风险集;把所述潜在风险集与所述风险指示模式进行比较,以获得先决条件风险集;生成表示所述先决条件风险集的信号;以及把表示所述先决条件风险集的信号存储在电子存储器中。所述方法还可以包括:根据所述先决条件风险确定即将来临的风险,所述即将来临的风险使用所述风险标识算法来确定,所述即将来临的风险与来自所述先决条件风险集中的至少一个风险相关联;生成表示所述即将来临的风险的信号;以及将表示所述即将来临的风险的信号存储在所述电子存储器中。又此外,所述方法还可以包括:在存储表示所述先决条件风险集的信号之后确定具体化的风险,所述具体化的风险使用所述风险标识算法来确定,所述具体化的风险与所述风险集相关联;生成表示所述具体化的风险的信号;以及把表示所述具体化的风险的信号存储在所述电子存储器中。此外,所述方法还可以又包括:在存储表示所述即将来临的风险的信号之后确定具体化的风险,所述具体化的风险使用所述风险标识算法来确定,所述具体化的风险与所述即将来临的风险相关联;生成表示所述具体化的风险的信号;以及把表示所述具体化的风险的信号存储在所述电子存储器中。In one aspect of the present invention, a computer-implemented method for mining risks is provided. The method includes providing a set of risk indication patterns on a computing device; querying the full set using the computing device to identify a set of potential risks using a risk identification algorithm based at least in part on the set of risk indication patterns associated with the full set; comparing the potential risk set with the risk indication patterns to obtain a set of prerequisite risks; generating a signal representing the set of prerequisite risks; and storing the signal representing the set of prerequisite risks in an electronic memory. The method may also include determining an impending risk based on the prerequisite risks, the impending risk determined using the risk identification algorithm, the impending risk associated with at least one risk from the set of prerequisite risks; generating a signal representing the impending risk; and storing the signal representing the impending risk in the electronic memory. Furthermore, the method may also include determining a materialized risk after storing the signal representing the set of prerequisite risks, the materialized risk determined using the risk identification algorithm, the materialized risk associated with the set of risks; generating a signal representing the materialized risk; and storing the signal representing the materialized risk in the electronic memory. In addition, the method may further include: determining a materialized risk after storing the signal representing the imminent risk, the materialized risk being determined using the risk identification algorithm, the materialized risk being associated with the imminent risk; generating a signal representing the materialized risk; and storing the signal representing the materialized risk in the electronic memory.
令人期望地,所述全集是数字的。所述全集可以包括但不限于:新闻;财经信息,包括但不限于股票价格数据及其标准偏差(不稳定性);政府和规章性报告,包括但不限于政府机构报告,诸如税务申报、医疗申报、法律申报、食品药品监督管理局(FDA)申报、证券交易委员会(SEC)申报之类的规章性申报;私有实体发表,包括但不限于年报、时事通讯、广告和新闻发布;博客;网页;事件流;协议文件;社交网络服务上的状态更新;电子邮件;短消息服务(SMS);即时聊天消息;Twitter推文;和/或其组合。Desirably, the corpus is digital. The corpus may include, but is not limited to: news; financial information, including but not limited to stock price data and its standard deviation (instability); government and regulatory reports, including but not limited to government agency reports, regulatory filings such as tax filings, medical filings, legal filings, Food and Drug Administration (FDA) filings, Securities and Exchange Commission (SEC) filings, and the like; private entity publications, including but not limited to annual reports, newsletters, advertisements, and press releases; blogs; web pages; event streams; protocol documents; status updates on social networking services; emails; short message services (SMS); instant chat messages; Twitter tweets; and/or combinations thereof.
所述风险标识算法可以是基于各种因素和/或准则。例如,所述风险标识算法可以基于但不限于:在统计上与风险相关联的术语集;基于时间因素;基于定制化准则集等等;以及其组合。所述定制化的准则集例如可以包括和/或考虑:行业准则、地理准则、货币准则、政治准则、严重性准则、紧迫性准则、主题事项准则、话题准则、命名实体集,以及其组合。The risk identification algorithm can be based on various factors and/or criteria. For example, the risk identification algorithm can be based on, but is not limited to: a set of terms statistically associated with risk; a time factor; a customized set of criteria; and combinations thereof. The customized set of criteria can, for example, include and/or consider: industry criteria, geographic criteria, monetary criteria, political criteria, severity criteria, urgency criteria, subject matter criteria, topic criteria, a named entity set, and combinations thereof.
在本发明的一个方面中,所述风险标识算法可以是基于源评级集。如本文所使用的,短语“源评级”指的是源的评级,例如但不限于关联性、可靠性等等。源评级集可以与源集具有一对一对应性。源集可以充当所述全集基于其的信息的源。可以基于即将来临的风险、具体化的风险及其组合对所述源评级集进行修改。In one aspect of the present invention, the risk identification algorithm can be based on a source rating set. As used herein, the phrase "source rating" refers to a rating of a source, such as, but not limited to, relevance, reliability, and the like. A source rating set can have a one-to-one correspondence with a source set. A source set can serve as the source of information upon which the full set is based. The source rating set can be modified based on impending risk, materialized risk, or a combination thereof.
本发明的方法还可以包括:传送表示所述先决条件风险集的信号,传送表示所述即将来临的风险的信号,传送表示所述具体化的风险的信号,以及其组合。此外,本发明还可以包括使用以下各项的至少一个提供基于web的风险警报服务:表示所述风险集的信号,表示所述即将来临的风险的信号,表示所述具体化的风险的信号,以及其组合。The method of the present invention may further include transmitting a signal representing the set of prerequisite risks, transmitting a signal representing the impending risk, transmitting a signal representing the materialized risk, and combinations thereof. Furthermore, the present invention may further include providing a web-based risk alert service using at least one of the following: a signal representing the set of risks, a signal representing the impending risk, a signal representing the materialized risk, and combinations thereof.
在本发明的另一方面中,一种计算设备可以包括:电子存储器;以及至少部分地基于与存储在所述电子存储器中的全集相关联的风险指示模式集的风险标识算法。处理器(未示出)可以被用来运行计算机设备上的算法。计算设备可以包括用于对风险标识算法进行查询的计算机接口,其被描绘成(但不限于)键盘。计算设备可以包括用于接收来自所述电子存储器的信号并且用于显示来自风险标识算法的风险警报的显示器。In another aspect of the present invention, a computing device may include: electronic storage; and a risk identification algorithm based at least in part on a set of risk indicative patterns associated with a full set stored in the electronic storage. A processor (not shown) may be used to execute the algorithm on the computing device. The computing device may include a computer interface, depicted as (but not limited to) a keyboard, for querying the risk identification algorithm. The computing device may also include a display for receiving signals from the electronic storage and for displaying risk alerts from the risk identification algorithm.
在本发明的另一方面中,提供一种用于向用户警报风险的计算机系统。所述系统可以包括具有电子存储器和风险标识算法的计算设备,所述风险标识算法至少部分地基于与存储在所述电子存储器中的全集相关联的风险指示模式集。可以被用来运行计算机设备上的算法。所述系统还可以包括用户接口,以供对所述风险标识算法进行查询并且以供接收来自计算设备的电子存储器的用于向用户警报风险的信号。所述用户接口可以包括但不限于计算机、电视、便携式媒体设备和/或诸如蜂窝电话、个人数字助理等之类的web使能的设备。In another aspect of the present invention, a computer system for alerting a user to a risk is provided. The system may include a computing device having electronic storage and a risk identification algorithm, the risk identification algorithm being based at least in part on a set of risk-indicating patterns associated with a full set stored in the electronic storage. The algorithm may be used to execute the algorithm on the computing device. The system may also include a user interface for querying the risk identification algorithm and for receiving a signal from the computing device's electronic storage for alerting the user to the risk. The user interface may include, but is not limited to, a computer, a television, a portable media device, and/or a web-enabled device such as a cellular phone, a personal digital assistant, or the like.
在实现中,可以自动地或者半自动地即在某种程度的人类干预的情况下执行本发明的概念。同样,本发明不由本文所描述的特定实施例限于范围中。应当完全考虑到,根据前面的描述和附图,除了本文所描述的那些之外的其他各种实施例及对本发明的修改将对本领域技术人员变得显而易见。因此,此类其他实施例和修改应当意图落在以下所附权利要求书的范围内。此外,尽管本文在特定实施例和实现和应用的上下文中并且在特定环境中描述了本发明,但是本领域技术人员将认识到,其有用性不限于此,以及本发明可以出于任何数目的目的有益地采用任何数目的方式和环境来应用。因此,应当按照如本文所公开的本发明的完全范围和精神来解释下面阐述的权利要求书。In implementation, the concepts of the present invention may be performed automatically or semi-automatically, i.e., with some degree of human intervention. Likewise, the present invention is not limited in scope by the specific embodiments described herein. It should be fully contemplated that various other embodiments and modifications to the present invention besides those described herein will become apparent to those skilled in the art based on the foregoing description and accompanying drawings. Therefore, such other embodiments and modifications should be intended to fall within the scope of the following claims. Furthermore, although the present invention is described herein in the context of specific embodiments and implementations and applications and in specific environments, those skilled in the art will recognize that its usefulness is not limited thereto, and that the present invention may be beneficially applied in any number of ways and environments for any number of purposes. Therefore, the claims set forth below should be interpreted in accordance with the full scope and spirit of the invention as disclosed herein.
Claims (46)
Applications Claiming Priority (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US13/337,703 US20120316916A1 (en) | 2009-12-01 | 2011-12-27 | Methods and systems for generating corporate green score using social media sourced data and sentiment analysis |
| US13/337703 | 2011-12-27 | ||
| PCT/US2012/071626 WO2013101812A1 (en) | 2011-12-27 | 2012-12-26 | Methods and systems for generating corporate green score using social media sourced data and sentiment analysis |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| HK1203678A1 HK1203678A1 (en) | 2015-10-30 |
| HK1203678B true HK1203678B (en) | 2020-02-07 |
Family
ID=
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CA2862273C (en) | Methods and systems for generating corporate green score using social media sourced data and sentiment analysis | |
| US11227109B1 (en) | User interface for use with a search engine for searching financial related documents | |
| US20120296845A1 (en) | Methods and systems for generating composite index using social media sourced data and sentiment analysis | |
| US11257161B2 (en) | Methods and systems for predicting market behavior based on news and sentiment analysis | |
| US20190311312A1 (en) | Methods and systems for generating supply chain representations | |
| Kenett et al. | Operational Risk Management: a practical approach to intelligent data analysis | |
| US20120221485A1 (en) | Methods and systems for risk mining and for generating entity risk profiles | |
| Chen et al. | Exploring public mood toward commodity markets: a comparative study of user behavior on Sina Weibo and Twitter | |
| Chatterjee et al. | Classifying facts and opinions in Twitter messages: a deep learning-based approach | |
| Zimbra et al. | Stakeholder analyses of firm-related Web forums: Applications in stock return prediction | |
| CN114303140A (en) | Analysis of intellectual property data related to products and services | |
| Bodendorf et al. | Business analytics in strategic purchasing: identifying and evaluating similarities in supplier documents | |
| Balaneji et al. | Applying sentiment analysis, topic modeling, and xgboost to classify implied volatility | |
| Kneppers | Developing a Data Driven CSR Tool for Impactful Performance Analysis at a Dutch Bank: Leading Towards the Exclusion of Investments in Controversial Activities of (Potential) Clients | |
| Albaroudi et al. | Addressing intersectional bias in AI recruitment using HITHIRE model: a fair, ethical, green AI and transparent hiring solution for Saudi Arabia’s diverse workforce in line with vision 2030 | |
| Moniz | Textual analysis of intangible information | |
| Sun | Sourcing Risk Detection and Prediction with Online Public Data: An Application of Machine Learning Techniques in Supply Chain Risk Management | |
| Zaki | An Ontological Approach for Monitoring and Surveillance Systems in Unregulated Markets | |
| HK1203678B (en) | Methods and systems for generating green score using data and sentiment analysis | |
| HK1216445B (en) | Methods and systems for generating composite index using social media sourced data and sentiment analysis | |
| Mini et al. | Monitoring Public Participation in Multilateral Initiatives Using Social Media Intelligence | |
| Oladejo | A Grounded Text Mining Approach to Social Media Records Selection and Classification | |
| Saljoughi Badlou | Studying the Evolution of Bitcoin-Related Topics Extracted from an Online Forum | |
| Duan | The Applications of Exogenous Data and Emerging Technologies in Accounting and Auditing | |
| Absi Halabi | Discovering the Most Promising Ideas in a Crowdsourcing Platform for Product Development |