WO2024229036A1 - Systems and methods for exploring quantifiable trends in line charts - Google Patents
Systems and methods for exploring quantifiable trends in line charts Download PDFInfo
- Publication number
- WO2024229036A1 WO2024229036A1 PCT/US2024/027076 US2024027076W WO2024229036A1 WO 2024229036 A1 WO2024229036 A1 WO 2024229036A1 US 2024027076 W US2024027076 W US 2024027076W WO 2024229036 A1 WO2024229036 A1 WO 2024229036A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- labeled
- trend
- events
- event
- search
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
- G06F16/287—Visualization; Browsing
Definitions
- the disclosed implementations relate generally to data visualization and more specifically to systems, methods, and user interfaces that enable users to explore quantifiable trends in time series data.
- Natural language and search interfaces facilitate data exploration and provide visualization responses to analytical queries based on underlying datasets.
- Existing search tools support basic analytical intents to just document search, fact-finding, or simple retrieval of data values, and have limited support for more specific analytic tasks such as the identification of precise temporal trends in time-series data.
- Trend analysis is an important aspect of the data analysis and decision-making process.
- Trends are data patterns that indicate a general change in data attributes (e.g., data fields, or data values of a data field) over time.
- the identification of data trends can in turn lead to the recognition of anomalies or deviations from normal or expected values of a dataset, due to factors such as significant events, seasonality, and market conditions.
- Visual data analysis tools often visualize trends as line charts. These tools can also provide additional computation functionality such as moving averages, trend lines, or regression analysis to indicate how the data changes over time.
- Search interfaces can facilitate data exploration and provide visualization responses to analytical queries based on underlying datasets.
- search engines can provide data relevant to the user’s query in the form of visualizations and/or widgets.
- Natural language interfaces for visual data analysis and large language models (LLMs) make it easy and convenient for a user to interact with a device and query data through the translation of user intent into device commands.
- search tools can only support basic analytical tasks such as document search, fact-finding, or simple retrieval of data values. These tools have limited support for more specific analytic tasks, such as computing derived values, finding correlations between variables, creating clusters of data points, and identifying temporal trends. NLI- related tools that are currently available tend to focus on the general support of analytical inquiry and do not consider the interpretation of intents specific to data trends.
- the human language e.g., natural language
- the human language is remarkably diverse when it comes to describing data trends.
- Expressions such as “slow increase,” “steady increase,” “exploding,” “slumping,” and “tanking” convey different extents (e.g., relative magnitudes or degrees) of changes in data values and are likely to invoke different user responses.
- the expressive power in these scenarios comes from the precise, quantified semantics of these words used to describe the trends, which existing NLI systems are unable to capture or leverage upon.
- Some implementations of the present disclosure describe a system and user interface that enables users to search for phenomena in a dataset by leveraging precise, quantified semantics of language, focusing on searching for trends in time series data.
- Some implementations of the present disclosure describe generating a labeled dataset of semantic concepts describing trends and their quantifiable properties, collected through crowdsourced data collection experiments.
- the disclosed dataset maps numeric slopes to semantic trend descriptor words.
- the dataset includes slope labels (e.g., “falling”) and slope labels with modifiers (“slowly falling”), along with multi-line trends that comprise a combination of “up”, “down”, and “flat” trend segments (e.g., “peak” and “valley”).
- the dataset provides useful metadata that enable a structured approach to indexing, classification, and retrieval of trends in a search system. Metadata that can encapsulate language describing slopes and angles can further enhance the precision and recall of search interfaces. Based on this dataset, an approach for applying semantic trend descriptor labels to raw time series data is introduced.
- Some implementations of the present disclosure describe a system and interface that leverage a quantified semantics dataset and labeling algorithms to produce a novel analytical search experience that supports diverse trend search intents and facilitates the retrieval and visualization of temporal data patterns.
- the disclosed system and interface also known as “SlopeSeeker,” incorporates custom logic for scoring and ranking results based on both the label relevance and visual prominence of trends.
- SlopeSeeker surfaces a semantic hierarchy of trend descriptor terms from our dataset, with which the user can interact to filter results down to only those that it deems most relevant.
- the present disclosure extends the capabilities of general searching to support intents that involve trends and their properties in line charts. For instance, SlopeSeeker detects analytical trend intents in the search queries and finds trends matching the specified quantifiable properties such as “sharp decline” and “gradual rise” in line charts. By leveraging quantified semantics of language, the present disclosure uniquely explores the nuances of trend patterns and their properties using natural language as the modality for expressing trend search queries.
- SlopeSeeker also integrates text with the search results, along with faceted browsing, to provide additional information and expressivity for navigating the search results.
- the present disclosure also builds upon search and natural language interface data analysis systems to support the exploration of trends with a comprehensive labeled semantic concept map of trends and their properties.
- a method for analyzing data trends is performed at a computing device having one or more processors, and memory storing one or more programs configured for execution by the one or more processors.
- the method includes receiving a first natural language input specifying one or more search terms directed to a dataset.
- the dataset comprises a set of time series data.
- the method includes, in response to receiving the first natural language input, parsing the first natural language input into one or more tokens.
- the method includes assigning a respective semantic role to each of the one or more tokens.
- the method includes translating (i) the one or more tokens and (ii) one or more semantic roles assigned to the one or more tokens into one or more queries.
- the method includes executing the one or more queries against a search index to retrieve a plurality of labeled trend events.
- Each labeled trend event (i) corresponds to respective portion of a respective line chart of a set of line charts representing the time series data and (ii) has a respective chart identifier.
- the method includes determining, for each labeled trend event, a respective composite score.
- the method includes individually assigning each of the plurality of labeled trend events to a respective group according to the respective chart identifier, where each group (i) includes one or more respective labeled trend events and (ii) corresponds to one respective line chart in the set of line charts.
- the method includes sorting, for each group of the one or more groups, the one or more respective labeled trend events within the respective group according to respective composite scores corresponding to the one or more respective labeled trend events.
- the method includes determining, for each group of the one or more groups, a respective final score.
- the method includes ranking the one or more groups according to one or more determined final scores.
- the method includes retrieving, from the dataset, data corresponding to a first subset of line charts having the respective chart identifiers of the ranked one or more groups in accordance with the ranking.
- the method includes generating the first subset of line charts.
- the method includes annotating respective segments of the first subset of line charts that correspond to the labeled trend events.
- the method also includes displaying one or more line charts of the first subset of line charts as annotated.
- the method further includes, after retrieving the data corresponding to the first subset of line charts, generating, for each line chart in the first subset of line charts, a respective text snippet describing a predefined number of events that match the one or more search terms, including annotating respective words in the respective text snippet that match the one or more search terms.
- Displaying the one or more of the first subset of line charts as annotated includes displaying the respective text snippet with each line chart in the one or more line charts.
- the method further includes displaying the annotated respective words with a different visual characteristic from other words in the respective text snippet and displaying the annotated respective segments with a different visual characteristic from other segments of the one or more line charts.
- the respective semantic role for each token comprises a predefined category of a plurality of categories.
- the plurality of categories includes two or more of: an event type, a trend, an attribute, and a date range.
- the plurality of categories includes the event type and the event type is either a single event or a multi-sequence event.
- each labeled trend event of the plurality of labeled trend events is identified by a respective chart ID, a respective start point, a respective end point, and a respective set of semantic labels.
- each labeled trend event of the plurality of labeled trend events is a respective labeled slope segment of a respective line chart in the set of line charts.
- the respective composite score for each labeled trend event is computed based on (1) a respective label score representing an extent to which the one or more search terms match respective labels of the plurality of labeled trend events and (2) a respective visual saliency score.
- the respective composite score is a product of the respective label score and the respective visual saliency score.
- determining, for each labeled trend event, the respective composite score includes computing the respective label score according to (i) a frequency with which the search terms occur in the respective labeled trend event and (ii) a label length of the respective labeled trend event.
- each line chart in the set of line charts is a plot of data values of a data field over a predefined time span.
- Determining, for each labeled trend event, the respective composite score includes computing the respective visual saliency score according to (i) the temporal duration of the respective portion of the respective line chart relative to the predefined time span and (ii) the first difference in the data values of the data field over the temporal duration relative to a second difference in the data values of the data field over the predefined time span.
- each line chart in the set of line charts has the same time span.
- the search index stores (i) first vector representations corresponding to the plurality of labeled trend events, (ii) second vector representations corresponding to a plurality of encoded tokens, and (iii) respective mapping relationships between the first vector representations and the second vector representations.
- the retrieved plurality of labeled trend events includes a first labeled trend event corresponding to an exact match of the one or more tokens and a second labeled trend event corresponding to an inexact match of the one or more tokens.
- the method further includes: when no exact match exists between the retrieved plurality of labeled trend events and the one or more tokens: (i) generating and displaying a notification indicating that there is no exact match for the one or more terms and (ii) displaying one or more user-selectable text labels corresponding to synonyms of the one or more terms.
- a method for analyzing data trends is performed at a computing device having one or more processors, and memory storing one or more programs configured for execution by the one or more processors.
- the method includes receiving a natural language input specifying a plurality of search terms directed to a dataset.
- the plurality of search terms includes a first search term and a second search term.
- the second search term is subsequent to the first search term in the natural language input.
- the dataset comprises a set of time series data.
- the method includes, in response to receiving the natural language input: when the first search term and the second search term specify a first sequence of data trends: (i) for the first search term, executing one or more first queries against a search index to retrieve a first set of labeled trend events and (ii) for the second search term, executing one or more second queries against the search index to retrieve a second set of labeled trend events.
- Each labeled trend event in the first and second sets of labeled trend events (i) corresponds to respective portion of a respective line chart of a set of line charts representing the time series data and (ii) has a respective chart identifier.
- the method includes constructing one or more sequences of labeled trend events based on the retrieved first and second sets of labeled trend events.
- the method includes assigning each sequence of labeled trend events, of the one or more sequences of labeled trend events, into one or more groups according to the respective chart identifier.
- the method includes determining, for each group of the one or more groups, a respective final score.
- the method includes ranking the one or more groups according to one or more determined final scores.
- the method includes retrieving, from the dataset, data corresponding to a subset of line charts having the respective chart identifiers of the ranked one or more groups in accordance with the ranking.
- the method includes generating the subset of line charts.
- the method includes annotating respective segments of the subset of line charts that correspond to the sequences of labeled trend events.
- the method also includes displaying one or more line charts of the subset of line charts as annotated.
- constructing the one or more sequences of labeled trend events based on the retrieved first and second sets of trend events includes, for each sequence of labeled trend events: joining a respective first labeled trend event corresponding to the first search term and a respective second labeled trend event corresponding to the second search term, according to (i) a respective chart identifier corresponding to the respective first labeled trend event and the respective second labeled trend event and (ii) respective start and end dates of the respective first labeled trend event and the respective second labeled trend event.
- the method includes after constructing the one or more sequences of labeled trend events, determining, for each sequence of the one or more sequences, a respective sequence score by aggregating one or more respective composite scores corresponding to one or more respective labeled trend events in the respective sequence.
- the respective final score for each group of the one or more groups is an aggregation of one or more respective sequence scores, from one or more respective sequences of labeled trend events, in the respective group.
- the method further includes, for a respective labeled trend event in the respective sequence, determining a respective composite score for the respective labeled trend event based on (1) a respective label score representing an extent to which a respective search term matches respective labels of a respective set of labeled trend events and (2) a respective visual saliency score.
- the respective composite score is a product of the respective label score and the respective visual saliency score.
- determining the respective composite score includes computing the respective label score according to (i) a frequency with which the search terms occur in the respective labeled trend event and (ii) a label length of the respective labeled trend event.
- each line chart in the set of time series line charts is a plot of data values of a data field over a predefined timespan.
- Determining the respective composite score includes computing the respective visual saliency score according to (1) the temporal duration of the respective portion of the respective line chart relative to the predefined timespan and (ii) the first difference in the data values of the data field over the temporal duration relative to a second difference in the data values of the data field over the predefined timespan.
- the plurality of search terms specified in the natural language input includes a third search term.
- the first sequence of data trends is specified by the first search term, the second search term, and the third search term.
- the method further includes, when the constructed one or more sequences of labeled trend events are partial sequence matches of the natural language input, determining, for each sequence of the one or more sequences, a respective sequence score based at least in part on (i) the number of events in the respective sequence and (ii) the respective sequence offset.
- the respective final score for each group of the one or more groups is an aggregate of one or more respective sequence scores, from one or more respective sequences of labeled trend events, in the respective group.
- the determination that the first search term and the second search term specify the first sequence of data trends includes: parsing the natural language input that includes the first search term and the second search term into a plurality of tokens, including assigning (i) a first semantic role to a first token corresponding to the first search term and (ii) a second semantic role to a second token corresponding to the second search term, and determining, based on the assigned first and second semantic roles, that the first search term and the second search term specify the first sequence of data trends.
- parsing the natural language input includes determining that an event type corresponding to the natural language input is a multisequence event type.
- each labeled trend event of the first and second sets of labeled trend events is a respective labeled slope segment of a respective line chart in the set of line charts.
- each line chart in the set of line charts has the same time span.
- a computing device includes a display, one or more processors, and memory coupled to the one or more processors.
- the memory stores one or more programs configured for execution by the one or more processors.
- the one or more programs include instructions for performing any of the methods disclosed herein.
- a non-transitory computer readable storage medium stores one or more programs configured for execution by a computing device having a display, one or more processors, and memory.
- the one or more programs include instructions for performing any of the methods disclosed herein.
- Figure 1 illustrates an exemplary process for exploring data trends in time series data, in accordance with some implementations.
- Figure 2A provides a block diagram of a computing device, in accordance with some implementations.
- Figure 2B illustrates a labeled trend event, in accordance with some implementation.
- Figure 3 provides a block diagram of a data visualization server, in accordance with some implementations.
- Figure 4 illustrates an annotation-collection tool interface 400 for performing the crowdsourced study, in accordance with some implementations.
- Figure 5 illustrates various inter-word relationships and an average slope of annotations determined from a crowdsourced study, in accordance with some implementations.
- Figure 6 illustrates automatic labeling of visual features in a line chart, in accordance with some implementations.
- Figures 7A - 7C show the results of three crowdsourced experiments that were designed to collect a dataset of quantified semantics for trend descriptor words.
- Figure 8 illustrates a scatter plot showing implicit semantic hierarchies, in accordance with some implementations.
- Figure 9 illustrates examples of segment labeling with single labels, in accordance with some implementations.
- Figure 10 illustrates examples of segment labeling with compound labels, in accordance with some implementations.
- Figure 11 illustrates examples of shape labeling, in accordance with some implementations.
- Figure 12 shows a line chart that is output by the SlopeSeeker system in response to a user query that includes a superlative descriptor, in accordance with some implementations.
- Figure 13 shows a line chart that is output by the SlopeSeeker system in response to a user query that includes the terms “gradually increasing,” in accordance with some implementations.
- Figure 14A illustrates the SlopeSeeker system architecture, in accordance with some implementations.
- Figure 14B illustrates a user interface for the SlopeSeeker system, in accordance with some implementations.
- Figure 15 shows a code snippet for a search index configuration, in accordance with some implementations.
- Figure 16 illustrates a code snippet for defining the properties of fields within an index mapping in the search index, in accordance with some implementations.
- Figures 17A and 17B illustrate contents of a synonym file, in accordance with some implementations.
- Figures 18A - 18G provide a series of screenshots illustrating how the SlopeSeeker system allows a user to search for specific trends and data based on the quantified language of a natural language queries, in accordance with some implementations.
- Figures 19A - 19D are a series of screenshots illustrating building sequence queries, in accordance with some implementations.
- Figures 20A and 20B are screenshots illustrating the use of the SlopeSeeker system to search for more global trends that may not correspond to a single slope within a segment of a trend, in accordance with some implementations.
- Figures 21 A - 211 are a series of screenshots illustrating user interactions with the SlopeSeeker user interface, in accordance with some implementations.
- Figures 22A - 22D provide a flowchart of a method for analyzing data trends, in accordance with some implementations.
- Figures 23 A - 23E provide a flowchart of a method for analyzing data trends, in accordance with some implementations.
- Some implementations of the present disclosure are directed to systems, methods, and user interfaces that enable users to search and glean for patterns in trends.
- the present disclosure extends the capabilities of general search to supporting intents that involve trends and their properties in line charts.
- the disclosed system also referred to herein as SlopeSeeker, detects analytical trend intents in the search queries and finds trends matching the specified quantifiable properties such as “sharp decline” and “gradual rise” in line charts.
- Some implementations leverage quantified semantics of language to explore the nuances of trend patterns and their properties using natural language as the modality for expressing trend search queries.
- SlopeSeeker integrates text with the search results, along with faceted browsing, to provide additional information and expressivity for navigating the search results.
- the disclosed system also builds upon search and natural language interface data analysis systems to support the exploration of trends with a comprehensive labeled semantic concept map of trends and their properties.
- Figure 1 illustrates an exemplary process 100 for exploring data trends in time series data, in accordance with some implementations.
- the process 100 is performed by a computing device 200 executing an application that includes a user interface.
- the process 100 is performed by a server system 300 executing a web application that includes a user interface module.
- the process 100 includes receiving (112) a natural language input 104 directed to a dataset (e.g., datasets / data sources 140) that includes time series data.
- the natural language input 104 is received as text input via a search bar 102 (e.g., natural language input box) of a user interface 110.
- the natural language input is a voice command or any other type of user input.
- the process 100 includes executing (114) queries against a search index 130 to retrieve labeled trend events.
- the process 100 includes determining (116) a respective composite score for each labeled trend event.
- the process 100 includes grouping (118) the labeled trend events into “buckets” based on respective chart identifiers.
- the process 100 includes determining (120) a respective total score for each “bucket,” and displaying (122), via the user interface 110, one or more ranked charts 124 (e.g., ranked according to the total score).
- a displayed chart 124 includes a chart identifier 126.
- the displayed chart 124 includes one or more segments 127 corresponding to segments of the line chart whose time periods have the highest scoring matches.
- the annotated segments 127 are also visually emphasized (e.g., in a different color, or line thickness, or other visual emphasis) compared to other portions of the line chart.
- the chart 124 is displayed with an accompanying text snippet 128 describing up to three highest matches for the line chart.
- the emphasized chart segments and corresponding text snippets are interactively and bi-directionally linked. Hovering over a chart segment fades out any other emphasized segments and highlights the corresponding text in gray; hovering over a text snippet works similarly.
- FIG. 2A is a block diagram of a computing device 200, in accordance with some implementations.
- the computing device 200 include a desktop computer, a laptop computer, a tablet computer, and other computing devices that have a display and a processor capable of running an application 230.
- the computing device 200 typically includes one or more processing units (processors or cores) 202, one or more network or other communication interfaces 204, memory 206, and one or more communication buses 208 for interconnecting these components.
- the communication buses 208 include circuitry (sometimes called a chipset) that interconnects and controls communications between system components.
- the computing device 200 includes a user interface 210.
- the user interface 210 typically includes a display device 212.
- the computing device 200 includes input devices such as a keyboard, mouse, and/or other input buttons 216.
- the display device 212 includes a touch-sensitive surface 214, in which case the display device 212 is a touch-sensitive display.
- the touch-sensitive surface 214 is configured to detect various swipe gestures (e.g., continuous gestures in vertical and/or horizontal directions) and/or other gestures (e.g., single/double tap).
- a physical keyboard is optional (e.g., a soft keyboard may be displayed when keyboard entry is needed).
- the user interface 210 also includes an audio output device 218, such as speakers or an audio output connection connected to speakers, earphones, or headphones.
- some computing devices 200 use a microphone and voice recognition to supplement or replace the keyboard.
- the computing device 200 includes an audio input device 220 (e.g., a microphone) to capture audio (e.g., speech from a user).
- the memory 206 includes high-speed random-access memory, such as DRAM, SRAM, DDR RAM, or other random-access solid-state memory devices.
- the memory 206 includes non-volatile memory, such as one or more magnetic disk storage devices, optical disk storage devices, flash memory devices, or other non-volatile solid-state storage devices.
- the memory 206 includes one or more storage devices remotely located from the processors 202.
- the memory 206, or alternatively the non-volatile memory devices within the memory 206 includes a non- transitory computer-readable storage medium.
- the memory 206, or the computer-readable storage medium of the memory 206 stores the following programs, modules, and data structures, or a subset or superset thereof:
- an operating system 222 which includes procedures for handling various basic system services and for performing hardware dependent tasks;
- a communications module 224 which is used for connecting the computing device 200 to other computers (e.g., server 300) and devices via the one or more communication interfaces 204 (wired or wireless), such as the Internet, other wide area networks, local area networks, metropolitan area networks, and so on;
- a web browser 226 (or other application capable of displaying web pages), which enables a user to communicate over a network with remote computers or devices;
- an audio input module 228 e.g., a microphone module
- the captured audio may be sent to a remote server (e.g., a server system 300) and/or processed by an application executing on the computing device 200 (e.g., the application 230 or the natural language processor 234);
- the application 230 includes: o a graphical user interface 110 (e.g., the SlopeSeeker user interface as illustrated in Figures 14B, 18A - 18G, 19A - 19D, 20 A, 20B, and 21 A - 211) for a user to input natural language queries and display data visualizations (e.g., charts and line plots) and text snippets in response to the natural language queries; o an interface manager 232 (see Figure 14A), which receives natural language queries (e.g., via the SlopeSeeker user interface 110), passes the queries to a parser 236 (e.g., a semantic parser) (and/or a natural language processor 234) for processing, receives relevant documents from a search index 130, and generates outputs that include charts, accompanying annotations, and/or text snippets; o a natural language processor 234, which processes natural language queries; o a parser 236
- a parser 236 e.g., a semantic pars
- the ranking module 240 ranks the results based on how precisely the search term (e.g., natural language input query) matches the event labels of the document. In some implementations, the ranking module 240 ranks the results based on a visual saliency score of the labeled event.
- the search term e.g., natural language input query
- the search index 130 includes: o configuration settings 242 (e.g., configuration specifications), which define the requirements for analysis and indexing; o an analysis module 244, which processes query string tokens and retrieves the most relevant labeled events based on the degree of overlap between the set of query string tokens and the document string tokens; and o a ranking module 246, which ranks labeled trend event results returned from the search index 130.
- the ranking module 246 ranks the labeled trend event results by computing a respective label score and a respective visual saliency score for each labeled trend event that was returned from the search index 130;
- the datasets / data sources 140 include time series data 250.
- An example of time series data is data of stock prices over time. Other examples of time series data in other domains include healthcare trends, economic data trends, and climate patterns.
- the time series data includes labeled trend events 252, such as a first labeled trend event 252-1 and a second labeled trend event 252-2.
- the time series data 250 includes 1000, 5000, 10,000, 50,000, or 100,000 labeled trend events.
- Figure 2B shows a block diagram of a labeled trend event 252-1 in accordance with some implementations.
- a labeled trend event 252-1 corresponds to a portion of a line chart and is identified by a chart ID 262-1, a start point 264-1, an end point 266-1, and set of one or more labels 268-1.
- a user selects one or more datasets / data sources 140 (which may be stored on the computing device 200 or stored remotely) and input queries are directed to the selected dataset / data sources.
- a dataset or data source 140 includes a set of synonyms for data values, data field names, and/or trend analysis labels;
- APIs 256 for receiving API calls from one or more applications (e.g., a web browser 226, an application 230, a search index 130, and/or a language model application 258), translating the API calls into appropriate actions, and performing one or more actions; and
- a language model application 258 which executes one or more large language models (LLMs).
- LLMs large language models
- Each of the above identified executable modules, applications, or sets of procedures may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above.
- the above identified modules or programs i.e., sets of instructions
- the memory 206 stores a subset of the modules and data structures identified above.
- the memory 206 may store additional modules or data structures not described above.
- a subset of the programs, modules, and/or data stored in the memory 206 is stored on and/or executed by a server system 300.
- Figure 2 shows a computing device 200
- Figure 2 is intended more as a functional description of the various features that may be present rather than as a structural schematic of the implementations described herein.
- items shown separately could be combined and some items could be separated.
- some of the programs, functions, procedures, or data shown above with respect to the computing device 200 may be stored or executed on a server system 300.
- FIG. 3 is a block diagram of a server system 300, in accordance with some implementations.
- the server system 300 typically includes one or more processing units/cores (CPUs) 302, one or more network interfaces 304, memory 314, and one or more communication buses 312 for interconnecting these components.
- the server system 300 includes a user interface 306, which includes a display 308 and one or more input devices 310, such as a keyboard and a mouse.
- the communication buses 312 include circuitry (sometimes called a chipset) that interconnects and controls communications between system components.
- the memory 314 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid state memory devices, and may include non-volatile memory, such as one or more magnetic disk storage devices, optical disk storage devices, flash memory devices, or other non-volatile solid state storage devices.
- the memory 314 includes one or more storage devices remotely located from the CPUs 302.
- the memory 314, or alternatively the nonvolatile memory devices within the memory 314, comprises a non-transitory computer readable storage medium.
- the memory 314 or the computer readable storage medium of the memory 314 stores the following programs, modules, and data structures, or a subset thereof:
- an operating system 316 which includes procedures for handling various basic system services and for performing hardware dependent tasks;
- a network communications module 318 which is used for connecting the server 300 to other computers via the one or more communication network interfaces 304 (wired or wireless) and one or more communication networks, such as the Internet, other wide area networks, local area networks, metropolitan area networks, and so on;
- a web server 320 (such as an HTTP server), which receives web requests from users and responds by providing responsive web pages or other resources;
- a web application 330 e.g., the SlopeSeeker web application
- a web application 330 has the same functionality as a desktop application 230, but provides the flexibility of access from any device at any location with network connectivity, and does not require installation and maintenance.
- the web application 330 includes various software modules to perform certain tasks, such as: o a user interface module 110, which provides the user interface for all aspects of the web application 330; o an interface manager module 332, which receives natural language queries (e.g., via user interface module 110), passes the queries to a parser module 336 (and/or a natural language processor module 334) for processing, receives relevant documents from a search index 130, and generates outputs that include charts, accompanying annotations, and/or text snippets; o a natural language processor module 334; o a parser module 336 (e.g., a semantic parser), as discussed below with reference to Section V.C.; o a visualization generation module 338, which generates and displays data visualizations (e.g., line charts) and accompanying annotations, and text snippets; and o an optional ranking module 340, which has the same functionality as the optional ranking module 240.
- a user interface module 110 which provides the user interface for all aspects
- the server system 300 includes a database 350.
- the database 350 includes a search index 130, which is described in Figure 2 and Section V.D below.
- the search index 130 includes:
- configuration settings 242 e.g., configuration specifications, which define the requirements for analysis and indexing
- an analysis module 244 which processes query string tokens and retrieves the most relevant labeled events based on the degree of overlap between the set of query string tokens and the document string tokens;
- the ranking module 246 ranks the labeled trend event results by computing a respective label score and a respective visual saliency score for each labeled trend event that was returned from the search index 130;
- the database 350 includes zero or more datasets or data sources 140, which are used by the web application 330, the search index 130, and/or the language model web application 358.
- the datasets / data sources 140 include time series data 250.
- the time series data includes labeled trend events 252, such as a first labeled trend event 252-1 and a second labeled trend event 252-2, as described in Figures 2 A and 2B.
- a data source 140 includes synonyms 254 for data values, data field names, and/or trend labels.
- the memory stores APIs 356 for receiving API calls from one or more applications (e.g., a web server 320, a web application 330, a search index 130, and/or a language model web application 358), translating the API calls into appropriate actions, and performing one or more actions.
- applications e.g., a web server 320, a web application 330, a search index 130, and/or a language model web application 358
- the memory 314 stores a language model web application 358 that executes one or more LLMs.
- Each of the above identified executable modules, applications, or sets of procedures may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above.
- the above identified modules or programs i.e., sets of instructions
- the memory 314 stores a subset of the modules and data structures identified above.
- the memory 314 may store additional modules or data structures not described above.
- Figure 3 shows a server system 300
- Figure 3 is intended more as a functional description of the various features that may be present rather than as a structural schematic of the implementations described herein.
- items shown separately could be combined and some items could be separated.
- some of the programs, functions, procedures, or data shown above with respect to a server system 300 may be stored or executed on a computing device 200.
- the functionality and/or data may be allocated between a computing device 200 and one or more servers 300.
- Figure 3 need not represent a single physical device.
- the server functionality is allocated across multiple physical devices in a server system.
- references to a “server” include various groups, collections, or arrays of servers that provide the described functionality, and the physical servers need not be physically colocated (e.g., the individual physical devices could be spread throughout the United States or throughout the world).
- Figure 4 illustrates an annotation-collection tool interface 400 for performing the crowdsourced study, in accordance with some implementations.
- the tool interface 400 was implemented as a Typescript frontend and a Django backend attached to a PostgreSQL database, and includes a left portion 402 and a right portion 404.
- the left portion 402 of the interface 400 comprises 42 word labels consisting of: (i) words related to the basic shape descriptors, ‘up,’ ‘down,’ and ‘flat,’ (ii) adjectives that describe such shapes (e.g., ‘slow,’ ‘rapid,’) and (iii) words that describe the emergent shapes created by such regions (e.g., ‘plateau’ or ‘valley’).
- the inventors leveraged the hierarchy of hypernyms and hyponyms from Wordnet, whose depth typically ranges up or down to two hierarchical levels (e.g., “up” —> [‘increasing’, ‘ascending’]), as well as word2vec to identify related concepts such as “sharp” and “increasing.”
- the list contained 8 nouns, 13 adjectives, and 21 verbs. While this list is not exhaustive, the inventors considered the set of words as a starting point for collecting nuanced language that describes common features found in line charts. The words were displayed in a randomized order in the interface for each participant to avoid positional bias.
- the right portion 404 of the interface 400 displays 16 line charts shown in random order to each participant to mitigate any positional bias. The same charts were shown to all participants. The charts were generated in Chart. JS, showing years on the x-axis, ranging from 1960 to 2030. The title and its corresponding y-axis range were randomly assigned from one of the following topics: Average Income ($), Unemployment, Yards per Game, New Hire Referrals, Yearly Tourism, Rate of Inflation (%), Average House Price ($), Krakozhian Ducats per $US, Average Nightly Viewers, Economic Growth Rate (%), Gold Price ($/gram), Oil Price ($/barrel), Consumer Debt, Number of Wineries, Mortgage Rate (%), and Net Capital Flow ($).
- Each chart is a line graph constructed by connecting seven sequential line segments end to end. Each segment is randomly assigned one of nine different slopes: Up, Down, Flatw h slopes [1, -1, 0], SteepUp, SteepDown, SteepFlat F' h slopes [3, -3, 0], GentleUp, GentleDown, GentleFlat with slopes [0.5, -0.5, 0],
- the words are snapped to the nearest chart position.
- Words may be moved or deleted once they are attached to a chart.
- Individual words may be used on multiple charts and multiple times on a single chart. Multiple words can be dragged to the same feature in a chart.
- the chart identifier, the annotation, the position along the line graph where the annotation occurred, the date the annotation occurred, and a unique anonymous participant identifier were recorded.
- the data collected was analyzed by determining “term co-occurrence” and “annotation clustering,” with the goal of discovering quantifiable relationships among the different annotation terms.
- Annotation co-occurrence enables one to understand how often different annotation terms are used to label the same visual feature.
- Annotations are clustered using hierarchical clustering and Ward’s linkage calculated with Euclidean distance. These approaches tend to identify dense clusters while making a minimum number of assumptions about cluster size, shape, and count. Position matrix entries are assigned by segment co-occurrence. For example, if ‘quick’ and ‘fast’ cooccurred 10 times, then each would have the position [10] on the other’s axis. The matrix is then scaled so all values are in [0,1], and values of 1.0 are placed along the diagonal.
- Figure 5 illustrates various inter-word relationships and an average slope of annotations determined from the crowdsourced study, in accordance with some implementations.
- Term co-occurrence analysis quantifies which words are typically present together. Agglomerative hierarchical clustering of term co-occurrence results in distinct groups, suggesting a high degree of semantic agreement among participants.
- Some implementations disclose two techniques - shape identification and slope identification - for automatic labeling of visual chart features according to the dataset of terms and visual features obtained from the crowdsourced study (see Section above).
- shape identification is useful for discovering concrete shapes such as “peak” and “valley.”
- slope identification is useful for describing how univariate data changes along the y-axis.
- a univariate data set is referred to as a “signal” and the small annotated source signal whose shape the disclosed algorithm is looking for in a larger unlabeled signal is referred to as the “kernel.”
- Shape identification tries to find a participant-annotated visual feature in a larger unlabeled signal.
- Figure 6 illustrates automatic labeling of visual features in a line chart, in accordance with some implementations.
- Figure 6 part A shows an example of finding a “bump” and an “upturn,” in accordance with some implementations.
- This shape identification approach is particularly applicable to finding visual features that are constructed from multiple segments (e.g., a “peak” consists of a rising segment followed by a falling segment).
- a shape discovery algorithm for identifying a kernel signal’s shape within a larger unlabeled signal includes the steps of:
- kernels near the edge may consist of fewer than five segments.
- kernel signal For each such kernel signal, create shallow and deep variants of it where the normalized variant heights range from [0.1,1.0] in units of 0.1.
- the quantified slope semantics shown in Figure 5 provide an additional tool for visual feature identification. Specifically, the quantified slope semantics help identify specific relationships among line slope, hedge words, and the hedge word’s semantic modifiers.
- Figure 5 For example, Figure
- selecting an ⁇ adjective> ⁇ verb> annotation for a given chart region uses the following protocol:
- Some implementations combine the discovered semantic labels discussed above with additional information from the data set to form LLM queries. For example, some instances use a specific stock symbol “ALK” and the dates of the discovered visual feature to ask the GPT 3.5 LLM the templated question, “What happened between ⁇ July 8, 2014> and ⁇ July 9, 2014> that caused the stock symbol ⁇ ALK> to ⁇ tank>T”
- annotations and summaries describing visual features in charts can be used as metadata in search interfaces to find pre-authored charts based on search queries such as, “find me the sales chart that has a spike in 2009, followed by a gradual decline,” or in a voice assistant to ask for real-time notifications about data - “Hey Voice Assistant, tell me if this stock tanks '"
- search queries such as, “find me the sales chart that has a spike in 2009, followed by a gradual decline,” or in a voice assistant to ask for real-time notifications about data - “Hey Voice Assistant, tell me if this stock tanks '”
- the work can also provide language prompts to LLMs to support sketching interfaces used for generating data stories.
- Some implementations disclose creating a dataset of quantified semantic labels for trend descriptor words, which are in turn used in conjunction with a search index to return precise search results to an analyst.
- the crowdsourced dataset was subsequently operationalized by applying new trend descriptor labels to unlabeled time series data - in this case, stock prices over time.
- Some implementations apply a novel visual saliency scoring algorithm to the labeled stock data to help boost perceptually prominent trend descriptor results during search.
- the final stock data with labeled trends was subsequently used to populate the search index for the SlopeSeeker tool.
- Experiment 1 The inventors designed and conducted three crowdsourced experiments (referred to hereinafter as Experiment 1, Experiment 2, and Experiment 3) to collect a dataset of quantified semantics for trend descriptor words.
- Experiment 2 The details of the experiments are described in U.S. Provisional Patent Application No. 63/543,070, filed October 7, 2023, titled “Search Tool for Exploring Quantifiable Trends in Line Charts,” which is incorporated by reference herein in its entirety.
- Experiment 1 was to estimate a slope distribution for every label, where each label is a single trend descriptor word.
- the goal of Experiment 2 was to assess the impact of modifier adverbs on the quantified semantics of two-word trend descriptor phrases (e.g., “falling slowly”).
- the goal of Experiment 3 was to gather labels for different shapes found in time series data. For Experiment 3, a shape is defined as a pair of connected line segments with varying degrees of (1) inclination angle between the two lines and (2) overall 360° rotation or orientation.
- Figures 7A, 7B, and 7C show the results of the crowdsourced experiments.
- Figure 7A shows the 41 labels that were investigated in Experiment 1 and, for each label, a one-dimensional Kernel Density Estimation (KDE) indicating a probability density for the respective label over the range of -90° to 90°. Peak probability density was used to sort the labels from the most negative angle (steepest down) to the most positive angle (steepest up) from the top left to the bottom right, respectively.
- KDE Kernel Density Estimation
- KDE is a common tool for estimating the probability density function of a random variable without making assumptions about the nature of the distribution (e.g., that it is normal). As such, KDE is a useful tool for estimating the slope distributions for each label.
- Gaussian kernels are a common choice for smoothing KDE datapoints as the shape is symmetric, it has well-understood mathematical properties, and, notably, the ’bandwidth’ KDE parameter can be interpreted as the Gaussian standard deviation.
- Figure 7B shows the KDE distributions of Experiment 2’s compound labels over the range of -90° to 90°. Peak probability density was used to sort the labels from the most negative angle (steepest down) to the most positive angle (steepest up) from the top left to the bottom right, respectively.
- Figure 7C shows respective two-dimensional KDE plots for shape labels that were investigated in Experiment 3.
- Hyponym A word is a hyponym of another word if its IQR is subsumed by the other word’s IQR.
- Hypernym A word is a hypernym of another word if its IQR subsumes the other word’s IQR
- Temporal stock price data is used as an exemplary dataset, and relevant portions of temporal stock price data are annotated with labels such as “sharply climbing” and “cliff.”
- An exemplary algorithm for label assignment is as follows: decompose the input signal into linear segments, calculate angles and rotations over those segments, use those angles and rotations to index into the KDEs from Experiments 1, 2, and 3, and discover appropriate labels.
- the x and y axes of the temporal stock data need to be normalized to roughly the same scale. Otherwise, the angle and rotation calculations would be meaningless. There is no “correct” mathematical relationship or shared scale to be established between the stock price in dollars (or any other non-temporal measure) (e.g., y-axis) and time (e.g., x-axis). To resolve this, two observations were made:
- the quality of “steepness” is, to a large degree, perceptual and anchored to both the time range that is analyzed and the size and shape of the chart. For example, a multi-month sea level drop viewed over one year might be perceived as “gradual,” but that same drop viewed over 1000 years on a chart of the same size would be perceived as “sudden”.
- any point - even a point that is far away from all labels and should not be assigned a label - will always return a non-zero probability density score, resulting in some (possibly inappropriate) label.
- labels with the lowest scores were filtered out by taking the set of all segment labels (one label per segment), sorting them by their probability density, and filtering out the bottom 25%. The remaining 75% of labels were used for the SlopeSeeker database (e.g., in the database 350).
- Figure 9 illustrates examples of segment labeling with single labels, in accordance with some implementations.
- the three sub-charts correspond to the three levels of linearization. For clarity, only the top 25% of labels are shown in this example.
- Figure 10 illustrates examples of segment labeling with compound labels, in accordance with some implementations.
- the three sub-charts correspond to the three levels of linearization. For clarity, only the top 25% of labels are shown in this example.
- Figure 11 illustrates examples of shape labeling, in accordance with some implementations.
- the three sub-charts correspond to the three levels of linearization. For clarity, only the top 25% of labels are shown in this example.
- Superlatives In some instances, a user may use superlative descriptors (e.g., “maximum”, “minimum”, “highest point”, etc.) to search for particular features in line charts. With such descriptors, the user evidently wants to retrieve the single highest or lowest value throughout the time frame of the data. In some implementations, to support the querying of superlative features in the dataset, applicable superlative labels are generated. Using the timeseries data of stock prices as an example, for each stock, the highest and lowest values are identified over the length of the time series data. Then, the event consisting of 15 days both before and after the maximum or minimum is incorporated into the event label to facilitate ease of viewing, as illustrated in Figure 12.
- superlative descriptors e.g., “maximum”, “minimum”, “highest point”, etc.
- Figure 12 shows a line chart 1200 that is output by the SlopeSeeker system in response to a user query (e.g., maximum stock price”) that includes a superlative descriptor “maximum,” in accordance with some implementations. Segment 1202 of the chart corresponds to the event where the stock price was the highest.
- a user query e.g., maximum stock price
- FIG. 13 shows a line chart 1300 that is output by the SlopeSeeker system in response to a user query that includes the terms “gradually increasing.”
- segment 1302 and segment 1304 of the chart correspond to events that match the user query (e.g., based on slope)
- the event that occurred during 2016 i.e., corresponding to segment 1304 intuitively appears more prominent and impactful than the event in 2015 (segment 1302).
- each trend event is a vector that covers some of the encompassing chart’s visual space in both the x direction (i.e., the temporal duration of the trend) and in the y direction (i.e., the data value delta over the course of the trend).
- an exemplary algorithm for computing visual saliency is as follows: for each trend result (single-segment slopes): do
- the visual saliency is determined using Equation 1 below:
- the final labeled stock data that was loaded into the protype SlopeSeeker tool contains 8,353 data points (labeled events) for 100 different stocks.
- the time period for each stock is a three-year period from the start of 2014 to the end of 2016, and each labeled event covers some subset of this time span.
- the SlopeSeeker system is developed as a search tool to operationalize the dataset of quantified semantic trend labels (e.g., as discussed in Section IV).
- Figure 14A illustrates the SlopeSeeker system architecture, in accordance with some implementations.
- the input 1404 is passed to an interface manager 232, which in turn passes the raw query 1406 to a natural language parser 236.
- the query is then processed (1408) and the processed search terms are used to write queries to a search index 130 (e.g., an Elasticsearch index or other search indexing frameworks such as Solr, Sphinx, or OpenSearch).
- the search index 130 returns relevant “documents” 1410 (and their respective scores) to the interface manager 232, which generates the system output 1412 of charts and accompanying annotations.
- a “document” as used herein refers to a labeled trend event.
- the SlopeSeeker system is implemented as a webbased application using Python and a Flask backend connected to a React.js frontend.
- Elasticsearch a robust distributed search platform built on the open- source Apache Lucene, is employed. The platform offers scalability of data, real-time indexing for fast querying, and is adept at handling text-heavy and diverse datasets.
- a RESTful API is employed for easy integration with SlopeSeeker.
- Figure 14B illustrates a user interface 110 for the SlopeSeeker system, in accordance with some implementations.
- the user interface 110 is designed to provide an experience similar to that of using a search engine.
- a search bar 102 e.g., a natural language input box
- a notification box 1422 (labeled as “2” in Figure 14B) informs the user which terms are not being matched exactly.
- the user interface 110 includes a side bar 1424 (labeled as “3” in Figure 14B) that allows the user to optionally filter the results to include only specific labels that are of interest.
- the text box filter in the side bar 1424 is nested hierarchically by individual semantic concepts. For instance, “soaring” is a parent of both “slow soaring” and “fast soaring.” Results appear as tiles 1430 (e.g., tile 1430-1 and tile 1430-2) in a region 1426 below the search bar 102 (labeled as “4” in Figure 14B).
- Each tile 1430 corresponds to one stock and shows a line chart 1432 of the stock price over time (e.g., line chart 1432-1 for the stock FSLR and line chart 1432-2 for the stock ILMN), the stock ticker 1434, the number of matches 1436 for that input query for that stock, and a text description 1438 (e.g., text snippet) describing up to three highest matches for that stock.
- the time periods corresponding to those highest scoring matches are also emphasized in a different color on the line chart.
- the emphasized chart segments and corresponding text snippets are interactively and bi-directionally linked. Hovering over a chart segment will fade out any other emphasized segments and will highlight the corresponding text in gray; hovering over a text snippet works similarly. If a stock has more than three matches, the user can expand the tile (e.g., via selection of user-selectable affordance 1440) to show a list of the rest of the matches, which can also be hovered over to display on the line chart
- the SlopeSeeker system includes a semantic parser (e.g., parser 236 or parser module 336) for parsing trends that contain semantic labels, attributes, and temporal filter attributes.
- the semantic parser converts natural language inputs into structured representations, allowing for explicit reasoning, reduced ambiguity, and consistent interpretation.
- the semantic parser also provides the convenience of better traceability and are performant for structured tasks.
- the semantic parser is combined with one or more large language models (LLMs) (e.g., language model application 258 or language model web application 358).
- LLMs large language models
- the semantic parser is used for structured tasks and the LLMs are used for open-ended tasks in the context of a more comprehensive analytics tool.
- the semantic parser is implemented using Python’s open-source NLP library, SpaCy, which employs compositional semantics to identify tokens and phrases based on their semantics to create a valid parse tree from the input search query.
- the semantic parser takes as input the individual tokens in the query and assigns semantic roles to these tokens.
- event_type e.g., single event or multi-sequence event
- trend_terms e.g., “tanking” and “plateau”
- attr e.g., data attributes, data fields, or data field names, such as stock ticker symbols and company names
- date_range e.g., absolute date ranges and relative data ranges
- the tokens and their corresponding semantic roles are translated into a machine-interpretable form that can be processed to retrieve relevant search results in SlopeSeeker.
- the parser output is as follows:
- each labeled trend event (considered a “document” for the search scenario disclosed herein) is added to the search index, wherein indexed documents are first retrieved and then ranked according to a match score.
- the search index 130 e.g., Elasticsearch
- the search index 130 includes built-in scoring logic that is combined with a visual saliency score to produce a scoring mechanism tailored to the use case described herein.
- matching documents are grouped by their parent chart for presentation to a user (e.g., as tiles 1430 in the user interface 110).
- SlopeSeeker supports different types of queries beyond single trend events, including event sequences and more long-term, and global descriptors.
- the indexing phase creates indices for each of the documents in a dataset along with their metadata.
- Each document i.e., a labeled event, corresponding to a portion of a line chart identified by a chart ID, start point, end point, and set of labels
- di a document vector di
- n-gram string tokens are stored from these document vectors to support both partial (e.g., inexact) matches and exact matches at search time:
- the search index 130 performs synonym and edge n- gram processing according to specification in the search index settings.
- the search index 130 is configured to retrieve labeled trend events based on exact and partial (e.g., inexact) matches between the query tokens and the labeled data.
- a retrieved labeled trend event can be an exact match to at least one token (e.g., a user types “tanking” and a matching document would contain that word) or an inexact match to at least one token.
- An inexact match occurs when a search result is returned as a result of support for synonyms or edge n-gram matches.
- An example of a synonym match is when a user types in the word “plummeting” and no labeled event contains that word “plummeting,” but at least one labeled event contains the word “tanking,” and the specification for the search index settings has specified that "plummeting" and "tanking" are synonyms.
- An edge n-gram match occurs when the user only partially types in a search term and the search index can guess what the user means based on these first few letters. For instance, if the user types "dro", the search index would return documents that contain "dropping".
- the original vectors D and encoded tokens S are stored in the semantic search engine index by specifying the mapping of the content, which defines the type and format of the fields in the index.
- the “content” refers to the raw event label for each document. For example, a label for an event could be “tanking.”
- the “mapping” specifies how the content will be processed and interpreted by the search index 130 for storage as fields in the index.
- “Fields” are different ways of storing copies of documents' data in the search index 130 so that documents can be retrieved in various ways.
- the “type” (e.g., text type) and “format” of each field determines how it is stored and how documents can later be matched to search queries.
- a synonym field can take the label of “tanking” and map it to the synonym “plummeting” based on synonym specifications (e.g., specified in the search index settings), allowing this same document to be retrieved by user searches for either “tanking” or “plummeting.”
- an edge n-gram field can take the same label of “tanking” and map it to shortened sub-strings of that label, such as 'tan', so that searching these sub-strings will also retrieve that document.
- each semantic trend label and its associated stock data are stored as tokens in the search index in multiple processed formats (i.e., in different fields), enabling fast and flexible retrieval at search time.
- This indexing enables full-text search on the labels in the index, supporting exact-value search, fuzzy matching to handle typos and spelling variations, and n-grams for multi-word label matching.
- a scoring algorithm, tokenizers, and filters are specified as part of the search index settings. These settings specify how the matched documents are scored with respect to the input query, as well as the handling of tokens, including the conversion of tokens to lowercase and the addition of synonyms from a thesaurus.
- Figure 15 shows a code snippet for a search index configuration, in accordance with some implementations.
- the scripted similarity in the search index configuration defines a custom scoring mechanism for ranking search results based on term frequency (tf), inverse document frequency (idf), and normalization.
- the search index configuration also incorporates synonym expansion and edge n-grams for more flexible and comprehensive search results.
- a synonym file "synonyms_fmal.txt" is used to expand or replace terms.
- FIGs 17A and 17B The contents of the synonym file are illustrated in Figures 17A and 17B, in accordance with some implementations.
- the term “subsiding” has synonyms “lessening,” “lessen,” “relaxing,” “easing,” and “abating.”
- a natural language query that includes the term “subsiding” would cause the search index to return the same set of labeled trend events as another natural language query that includes the term “easing.”
- Figure 16 illustrates a code snippet for defining the properties of fields within an index mapping in the search index, in accordance with some implementations.
- the search phase can be conceptualized as having two steps - retrieval and ranking.
- a user input query q that is represented as a query vector q with query tokens q lt q 2 , ... , Qy.
- the “degree of overlap” refers to the magnitude of the set intersection (i.e., number of common tokens) between the set of query tokens and the set of document string tokens.
- the “most relevant” documents i.e., labeled trend events
- are a predefined number of labeled trend events e.g., 1000, 800, 500, or 200 that have the greatest “degree of overlap” out of all documents.
- the scoring function Tmax maximizes search relevance as follows:
- SlopeSeeker For search inputs that contain both a noun/verb descriptor (e.g., “decline”) and a modifying adjective (e.g., “fast”), SlopeSeeker subsequently filters out partially matching documents that contain only the adjective. For example, this would prevent a query of “fast decline” from returning documents labeled “fast increase” as partial matches. More formally, if s contains at least one token that matches a noun/verb descriptor in at least one document, then every matching document di must contain that descriptor in its set of string tokens Si. However, users may still enter search queries consisting only of an adjective and see documents where that adjective is paired with a variety of noun/verb descriptors.
- noun/verb descriptor e.g., “decline”
- a modifying adjective e.g., “fast”
- SlopeSeeker ranks document results (labeled trend event results) based on two components.
- the first component is how precisely the search term matches the event labels of the document, which is computed by the search index 130 according to the index and search settings.
- a scoring scheme is utilized where this document’s score is the frequency with which the search terms occur in its label, divided by the length of its label. This means that events with longer labels (e.g., those with modifying adjectives like “slow” or “fast”) will be scored higher than events with shorter labels if and only if the additional tokens accounting for the added length match the search terms.
- the numerator has value 2 because there are two search terms “slow” and “climbing” that match the user search input.
- the value of the denominator is 12 because there are 12 letters in the label.
- the second scoring component is the visual saliency score of the labeled event (Section IV.D).
- the visual saliency score quantifies the perceptual prominence of a trend event. It is specifically designed for the search scenario to favor the most visually salient events, motivated by prior research showing that text annotations corresponding to visually salient features of line charts are most effective at driving reader takeaways.
- the final composite score used to rank events in the results is then the product of the search index (e.g., Elasticsearch index, or other search indexing frameworks such as Solr, Sphinx, or OpenSearch) component and the visual saliency component.
- the search index e.g., Elasticsearch index, or other search indexing frameworks such as Solr, Sphinx, or OpenSearch
- the visual saliency component of scoring is most important when there are a large number of matching results for a user query.
- a user is interested in “stocks that increased.”
- search index scores e.g., Elasticsearch scores.
- these results are not likely to all be of equal interest to the user.
- a short three-day increase in stock price is probably less interesting, both visually and in terms of the analytical task at hand, compared to a three-month increase during which much more stock value was gained. Note that these could both have similar slopes and thus identical labels.
- the visual saliency scoring component thus serves as a tiebreaker to boost results with greater prominence and relevance over others that share identical labels.
- the indexed data and result scoring are at the level of the labeled trend event (e.g., document), where each labeled trend event is a labeled slope segment.
- Any individual chart e.g., stock
- SlopeSeeker SlopeSeeker
- events are not presented individually but are placed into buckets at search time based on their chart identifier (e.g., stock key).
- Events within a bucket are sorted by their composite score. Buckets themselves are also scored; the final score for each bucket is the sum of the composite scores of its individual events, and buckets are presented in sorted order according to this final score.
- this scheme is designed to create an experience akin to standard “document search,” where more matches in a bucket bump that bucket higher in the results.
- a sequence query includes of a list of trend events in a specified order.
- Each of the trend events can be a single word or multi-word. Since this type of query is not straightforward to support natively in the search index 130 (e.g., Elasticsearch) based on the way the data is indexed, sequence query results are constructed as follows. First, each individual constituent event is run through the search index 130 as its own single-word or multiword query but not yet bucketed. Then, sequences are constructed by taking these results and doing an SQL join based on chart identifier and start/end dates. In some implementations, a tunable parameter is included to allow for some temporal delay between adjacent events.
- edge subsequences e.g., for “up, flat, down”: “up”; “up, flat”
- other in-order sub-sequences e.g., for “up, flat, down”: “flat”; “down”; “flat, down”.
- each sequence’s score is assigned to be the sum of the composite scores of its constituent segments.
- the custom scoring scheme for sequence queries is based on the following formula:
- score 0 is the un-penalized score
- l seq is the number of events in the sequence being scored
- l q is the number of events in the query
- offset seq is the number of sequential events missing from the beginning of the sequence compared to the query.
- This custom scoring scheme applies two different penalties.
- a sub-sequence of length two e.g., “up, flat”
- a sub-sequence of length one e.g., “up”
- a non-edge sub-sequence (with a large offset) will be penalized to be scored lower than an edge sub-sequence (with zero offset).
- sub-sequence “up, flat” has zero offset because it begins at the same place as the initial query pattern, but “flat, down” has an offset of one since begins starts one event later in the sequence.
- subsequence partial matches that begin similarly to the desired sequence from the query should be scored higher than those that end similarly to the desired sequence.
- Figures 18A to 18G provide a series of screenshots illustrating how the SlopeSeeker system allows a user to search for specific trends and data based on the quantified language of a natural language queries, in accordance with some implementations.
- a user inputs a natural language query 1802 (e.g., “stocks that fell slowly”) into the search bar 102 (e.g., natural language inbox box).
- Figure 18B illustrates that, in response to receiving the natural language query, the user interface 110 displays results (e.g., as tiles 1804) corresponding to the query.
- Each of the results includes a respective chart 1806 (e.g., a line plot) showing stock prices over time.
- a respective chart 1806 can include one or more highlighted segments 1808 (e.g., visually emphasized segments, or segments that are encoded in a different color compared to the rest of the chart), corresponding to one or more specific events on the chart.
- each of the one or more specific events is an instance of falling stock prices, in which the respective segment is a segment of the chart with a negative slope.
- Figure 18C illustrates that when a user hovers (1810) the mouse over a highlighted event on a chart, the corresponding textual annotation 1812 is also highlighted.
- Figure 18D when a user hovers (1814) the mouse over a textual annotation, the portion of the chart corresponding to the textual annotation (i.e., segment 1801-1) is visually emphasized while the rest of the events on the chart (e.g., corresponding to segment 1808-2) are visually de-emphasized.
- the user interface can display a result that has an opposite meaning (or a partial match) to the semantics of the user query.
- Figure 18B shows that the chart corresponding to the stock ticker 1805 “ALK” includes a segment 1808-6 with a steep decline for the time period July 9, 2014 to July 10, 2014 (i.e., the vertical or near-vertical line segment).
- Figure 18E shows a user modification to the query via the search bar 102.
- a user modifies the query from “stocks that fell slowly” (1802) to “stocks that fell fast” (1816).
- the user interface displays an updated set of charts of stock prices over time, in which each of the charts includes one or more respective segments corresponding to fast falling stock prices.
- the segments of the line charts in Figure 18E have slopes that are more negatively inclined (e.g., with a more negative gradient).
- the computing system can update its search to identify trends in the time series data that corresponds to the updated modifier.
- Figure 18F illustrates user input of a natural language query 1818 (e.g., “stocks that are climbing”) into the natural language input box.
- the user interface 110 displays a notification box 1422 indicating that there are no exact matches to the query 1818 (e.g., “climbing”), and displays a set of charts 1820 with highlighted portions showing instances of stock prices gradually climbing over time.
- FIG 18F The visually emphasized portions of the charts in Figure 18F correspond to respective time periods, identified by the computing device, in which a respective stock price experiences a relatively slow and steady increase over a longer period of time (e.g., at least 3 months, 6 months, or 9 months over a three-year timespan).
- Figure 18F shows that the chart 1820-3 corresponding to stock ticker MLM and the stock corresponding to stock ticker HUM are highlighted over the entire duration that the charts span (e.g., three years).
- Figure 18G illustrates results that are displayed in response to a natural language query 1822 “stocks that were soaring,” in accordance with some implementations.
- the instances of “soaring” correspond to segments of the line graphs with steeper slopes over shorter periods of time compared to the results in Figure 18F,
- Figures 19A to 19D are a series of screenshots illustrating building sequence queries, in accordance with some implementations.
- a sequence query refers to a series of events that are identified in a single line graph.
- a user inputs a term 1902 (e.g., a single word) “up” in the natural language input box.
- the notification box 1422 displays a notification that there is no exact match for the word “up.”
- the notification further indicates that the computing device has identified synonyms or partial matches for the word “up,” which are shown in red in the results.
- the computing device identifies words (e.g., adjectives) such as “soaring” (1904), “accelerating” (1906) and “climbing” (1908), and/or words with modifiers such as “slow soaring,” “gradual accelerating,” and “fast climbing” to be synonymous with the term “up.”
- FIG. 19B Suppose that a user is interested to see stocks whose prices went up then down.
- the user enters a query 1910 “up, down” into the search bar 102.
- the user interface displays events that match the query.
- Each of the events corresponds to a pair of line segments corresponding to a time period where the stock price increased and then decreased.
- Each pair of line segments consists of a first line segment 1912, corresponding to a stock price increase, and a second line segment 1914 contiguous to the first line segment, corresponding to a stock price decrease.
- the user interface displays the respective first line segments using a first color encoding (e.g., red, orange, or black) and displays the respective second line segments using a second color encoding that is distinct from the first color encoding.
- a first color encoding e.g., red, orange, or black
- FIG. 19C there is a user interaction (1918) (e.g., a mouse hover action) with a textual annotation (e.g., the text portion “slow soaring then slumping from December 16, 2014 to July 8, 2015) corresponding to the chart in tile 1916.
- a textual annotation e.g., the text portion “slow soaring then slumping from December 16, 2014 to July 8, 2015
- the portion of the line graph (segment 1912-1 and 1914-1) corresponding to the textual annotation is visually emphasized, whereas other portions of the chart (e.g., including other events identified by the computing device as matching the query “up, down”) are visually de-emphasized.
- the user builds on the previous query by inputting a term (e.g., “up”) after the previous search terms, thus creating a query 1920 with a tuple of terms “up, down, up.”
- a term e.g., “up”
- the user interface displays a set of charts that have portions exhibiting a “price increase, price decrease, and price increase” trend.
- one of the charts identified by the computing device is a chart 1922, corresponding to stock ticker FLSR, that includes a portion corresponding to three segments (e.g., contiguous segments, or non-contiguous segments based on a tunable parameter as described in method 2300) over the time period March 2014 to June 2014, in which the stock price risen, then slumped, then sharply accelerated.
- line segments corresponding to the first stock price increase are encoded with a first color
- line segments corresponding to the stock price decrease are encoded with a second color distinct from the first color
- line segments corresponding to the second stock price increase are encoded with a third color that is distinct from both the first and second colors.
- Figures 20A and 20B are screenshots illustrating the use of the SlopeSeeker system to search for more global trends that may not correspond to a single slope within a segment of a trend, in accordance with some implementations.
- Figure 20A illustrates an example where a user inputs a query 2002 “stocks that are volatile.”
- the user interface displays charts corresponding to stock tickers FSLR and ALXN with highlighted portions corresponding to respective longer time periods (e.g., for at least one year over the three-year duration of the chart) in which the stock prices exhibit a longer period of volatility (e.g., instead of highlighting one particular slope of the chart).
- Figure 20B illustrates another example in which a user searches for “stocks that are consistent” over time (e.g., where there is very little deviation over the broad longer-term trend.
- the user interface displays two charts. Each of the charts includes a respective highlighted portion corresponding to a time period where the stock price was consistent.
- Figures 21 A to 211 are a series of screenshots illustrating user interactions with the SlopeSeeker user interface, in accordance with some implementations.
- Figure 21 A the user interface 110 receives the user input 2102 “Show me when stocks were surging.”
- Figure 2 IB shows that the user interface populates with result tiles 2104-1, 2104-2, 2104-3, and 2104-4, each corresponding to one stock, showing the stock price chart and textual annotations of when each event of interest occurred.
- a user can interact with the side bar 1424 to filter the results to a subset of results (e.g., to the type(s) of surging behavior that the user would like to see). For example, the user might only be interested in gradual surges (Figure 21C) or more sharp ones (Figure 21D).
- the user can further filter the results by modifying the natural language input query, such as “Show me when stocks were surging in 2016” (2106) as illustrated in Figure 2 IE.
- the user can further filter the results by modifying the natural language input query to specify a specific stock like “Monsanto,” as illustrated in the natural language input query 2108 “Show me when Monsanto was surging in 2016” as illustrated in Figure 21F.
- a user can also search for multi-line segment shapes.
- Figure 21G shows the query 2110 “Show me when stocks fell off a cliff in 2015” that is input by a user
- Figure 21H shows another natural query 2112 “Show me when stocks hit a trough.”
- the user interface displays results of stock prices that identify segments corresponding to when stock prices fell sharply (Figure 21G) or reached a low point (Figure 21H).
- Figure 211 illustrates results displayed on the user interface in response to a natural language query 2114 “show me when stocks went up then down, then back up.”
- the notification box 1422 informs the user that there are no exact matches for the terms “up” or down,” but there are related concepts that are shown on the side bar.
- the results show three sequences where the trend “up, down, up” as well as partial sequences where they exist otherwise.
- Figures 22A to 22D provide a flowchart of a method 2200 for analyzing data trends, in accordance with some implementations.
- the method 2200 is also called a process.
- the method 2200 is performed (2202) at a computing device 200 having a display 212, one or more processors 202, and memory 206.
- the memory 206 stores (2204) one or more programs configured for execution by the one or more processors 202.
- the operations shown in Figures 1, 4, 5, 6, 7A to 7C, 8, 9, 10, 11, 12, 13, 14A, 14B, 15, 16, 17A, 17B, 18A to 18G, 19A to 19D, 20A, 20B, and 21A to 211 correspond to instructions stored in the memory 206 or other non-transitory computer-readable storage medium.
- the computer-readable storage medium may include a magnetic or optical disk storage device, solid state storage devices such as Flash memory, or other non-volatile memory device or devices.
- the instructions stored on the computer-readable storage medium include one or more of: source code, assembly language code, object code, or other instruction format that is interpreted by one or more processors. Some operations in the method 2200 may be combined and/or the order of some operations may be changed. In some implementations, some of the operations in the method 2200 may be combined with other operations in the method 2300.
- Method 2200 extends the capabilities of general search to supporting intents that involve trends and their properties in line charts.
- Existing NLI-related tools tend to focus on the general support of analytical inquiry and do not consider the interpretation of intents specific to data trends.
- Method 2200 improves user experience by introducing a process that is capable of leveraging semantics to interpret semantic concepts describing trends and their quantifiable properties. For example, method 2200 determines analytical trend intents in the search queries and identifies trends matching specified quantifiable properties such as “sharp decline” and “gradual rise” in line charts.
- the returned line charts include annotations that emphasize the most visually prominent features of the chart, which are more effective at helping users glean meaningful takeaways. As a result, user experience with data communication is improved.
- the computing device receives (2206) a first natural language input specifying one or more search terms directed to a dataset.
- the dataset comprises a set of (e.g., a plurality of) time series data.
- the computing device receives the first natural language input via a search 102 (e.g., a natural language input box) of a user interface 110.
- the first natural language input is a verbal input or any other user input.
- the computing device in response to receiving the first natural language input, parses (2208) the first natural language input into one or more tokens.
- the computing device assigns (2210) a respective semantic role to each of the one or more tokens.
- the respective semantic role for each token comprises (2212) a predefined category of a plurality of categories.
- the plurality of categories includes (2214) two or more of an event type (e.g., single or multisequence), a trend term (e.g., “tanking,” “plateau,” “steep increase,” or “accelerating”), an attribute (e.g., one or more data fields and/or data values of data fields), and a date range (e.g., absolute or relative date ranges).
- an event type e.g., single or multisequence
- a trend term e.g., “tanking,” “plateau,” “steep increase,” or “accelerating”
- an attribute e.g., one or more data fields and/or data values of data fields
- a date range e.g., absolute or relative date ranges.
- the plurality of categories includes (2216) the event type.
- the event type is one of a single event or a multi -sequence event.
- the computing device translates (2218) (i) the one or more tokens and (ii) one or more semantic roles assigned to the one or more tokens into one or more queries (e.g., into a machine-interpretable form that can be processed to retrieve relevant search results).
- the computing device executes (2220) the one or more first queries against a search index (e.g., search index 130) to retrieve (from the search index) a plurality of labeled trend events.
- the search index is a search database, an Elasticsearch index, or other search indexing frameworks such as Solr, Sphinx, or OpenSearch.
- Each labeled trend event (i) corresponds (2222) to respective portion (that is less than all) of a respective line chart of a set of line charts (e.g., line graphs, line plots) representing the time series data and (ii) has a respective chart identifier.
- the retrieved labeled trend events can be exact matches to at least one token (e.g., the user types "tanking" and a matching document would contain that word) or inexact matches to at least one token.
- An inexact match occurs when a search result is only returned as a result of support for synonyms or edge n-gram matches.
- a synonym match could be if the user types in "plummeting" and no documents contain that word, but documents do contain "tanking" and the specification for the search index settings has specified that "plummeting" and "tanking" are synonyms.
- An edge n-gram match occurs when the user only partially types in a search term but the search index can guess what the user means based on these first few letters. For instance, if the user types "dro", the search index would return documents that contain "dropping".
- the synonym and edge n-gram processing are performed by the search index according to specification in the search index settings (see, e.g., Figure 15).
- each labeled trend event of the plurality of labeled trend events is (2224) identified by a respective chart ID 262-1, a respective start point (in time) 264-1, a respective end point (in time) 266-1, and a respective set of (one or more) semantic labels 268-1.
- a labeled trend event includes one (i.e., a single) semantic label.
- a labeled trend event includes two or more semantic labels. For example, if a trend event corresponds to a steep fall in stock price, it can be labeled ["tanking", “falling”, “slumping"], from the most to the least precise order.
- each labeled trend event of the plurality of labeled trend events is (2226) a respective labeled slope segment of a respective line chart in the set of line charts.
- each line chart in the set of time series line charts is (2228) a plot of data values of a data field (or changes in data values of a data field) over a predefined timespan.
- each line chart in the set of line charts has (2230) the same time span.
- each line chart in the set of line charts can have the same length of time, such as 6 months, 1 year, or 3 years.
- each line chart in the set of line charts can span the same time duration, such as from January 2021 to December 2023.
- the retrieved plurality of labeled trend events includes (2234) a first labeled trend event corresponding to an exact match of the one or more tokens.
- a user types “tanking” and the first labeled trend event contains the word “tanking.”
- the retrieved plurality of labeled trend events includes (2236) a second labeled trend event corresponding to an inexact match of the one or more tokens. An inexact match occurs when a search result is only returned as a result of support for synonyms or edge n-gram matches.
- a synonym match can be when a user inputs the term "plummeting" and no documents contain that word, but documents do contain "tanking" and the specification for the search index settings has specified that "plummeting" and "tanking” are synonyms.
- An edge n-gram match occurs when the user only partially types in a search term and Elasticsearch can guess what the user means based on these letters. For instance, if the user types "dro", the search index will return documents that contain "dropping" .
- the computing device in accordance with a determination that no exact match exists between the retrieved plurality of labeled trend events and the one or more tokens, generates (2238) and displays a notification (e.g., in a notification box 1422) indicating that there is no exact match for the one or more terms.
- the computing device displays (e.g., in a side bar 1424 of the user interface 110), one or more user-selectable text labels corresponding to synonyms of the one or more terms.
- the computing device determines (2240), for each labeled trend event, a respective composite score.
- the computing device individually assigns (2242) each of the plurality of labeled trend events to a respective group according to the respective chart identifier (see, e.g., Bucketing process in Section V.D.3).
- Each group (i) includes one or more respective labeled trend events and (ii) corresponds to one respective line chart in the set of line charts.
- Each group includes at least one labeled trend event.
- the number of groups is less than or equal to the number of line charts in the set of line charts.
- the computing device sorts (2244), for each group of the one or more groups, the one or more respective labeled trend events within the respective group according to respective composite scores corresponding to the one or more respective labeled trend events.
- the respective composite score for each labeled trend event is computed (2246) (e.g., by the search index 130 or by the computing device 200) based on (1) a respective label score representing an extent to which the one or more search terms match respective labels of the plurality of labeled trend events and (2) a respective visual saliency score (the visual saliency score quantifies the perceptual prominence of a trend event).
- the respective composite score is (2248) a product (i.e., multiplication) of the respective label score and the respective visual saliency score.
- the respective label score is computed (2250) according to (i) a frequency with which the search terms occur in the respective labeled trend event and (ii) a label length of the respective labeled trend event.
- determining (2252), for each labeled trend event, the respective composite score includes computing the respective visual saliency score according to (1) a temporal duration of the respective portion of the respective line chart relative to the predefined timespan and (ii) a first difference in the data values of the data field over the temporal duration relative to a second difference in the data values of the data field over the predefined timespan.
- the computing device determines (2254), for each group (e.g., bucket) of the one or more groups, a respective final score.
- the final score for each group is the sum of the composite scores of its individual events, and buckets are presented in sorted order according to this final score. Groups (or buckets) are only created if they have labeled trend events that would fit into them. In other words, there are no empty buckets, and so each bucket inherently has a score greater than zero.
- the computing device ranks (2256) the one or more groups according to one or more determined final scores.
- the computing device retrieves (2258), from the dataset, data corresponding to a first subset of (e.g., one or more) line charts having the respective chart identifiers of the ranked groups in accordance with the ranking.
- the computing device after retrieving the data corresponding to the first subset of line charts, generates (2260), for each line chart in the first subset of line charts, a respective text description describing a predefined number (e.g., up to three five, or seven) of events that matches the one or more search terms, including annotating (e.g., color-encoding or label-encoding) respective words in the respective text description that matches (e.g., partially or fully matches) the one or more search terms.
- a predefined number e.g., up to three five, or seven
- annotating e.g., color-encoding or label-encoding
- the computing device generates (2262) the first subset of line charts. For example, the computing device individually generates each line chart of the first subset of line charts.
- the computing device annotates (2264) (e.g., via color-encoding or labelencoding) respective segments of the first subset of line charts that correspond to the labeled trend events.
- the computing device displays (2266) one or more line charts of the first subset of line charts as annotated.
- displaying the one or more of the first subset of line charts as annotated includes (2268) displaying the respective text snippet with each line chart in the one or more line charts.
- the computing device displays (2270) the annotated respective words with a different visual characteristic from other words in the respective text description.
- the computing device displays (2272) the annotated respective segments with a different visual characteristic from other segments of the one or more line charts.
- Figures 23A to 23E provide a flowchart of a method 2300 for analyzing data trends, in accordance with some implementations.
- the method 2300 is also called a process.
- the method 2300 is performed (2302) at a computing device 200 having a display 212, one or more processors 202, and memory 206.
- the memory 206 stores (2304) one or more programs configured for execution by the one or more processors 202.
- the operations shown in Figures 1, 4, 5, 6, 7A to 7C, 8, 9, 10, 11, 12, 13, 14A, 14B, 15, 16, 17A, 17B, 18A to 18G, 19A to 19D, 20A, 20B, and 21A to 211 correspond to instructions stored in the memory 206 or other non-transitory computer-readable storage medium.
- the computer-readable storage medium may include a magnetic or optical disk storage device, solid state storage devices such as Flash memory, or other non-volatile memory device or devices.
- the instructions stored on the computer-readable storage medium include one or more of: source code, assembly language code, object code, or other instruction format that is interpreted by one or more processors. Some operations in the method 2300 may be combined and/or the order of some operations may be changed. In some implementations, some of the operations in the method 2300 may be combined with other operations in the method 2200.
- Method 2300 relates to receiving a sequence query that includes a set of trend events (single-word or multi-word) in a specified order.
- Method 2300 improves user experience by introducing a process that enables a user to search for a set of trend events (e.g., “rapidly falling” followed by “plateauing ”). Since this type of query is not straightforward to support natively in a search index, method 2300 constructs sequence query results by executing, for each individual constituent event, a respective set of queries to retrieve a respective set of labeled trend events. Then, sequences are constructed by taking these results and doing an SQL join based on chart identifier and start/end dates (with a tunable parameter to allow for some temporal delay between adjacent events). As a result, users can access time series data with multiple trends in a specified order. Thus, user experienced is enhanced.
- the computing device receives (2306) (e.g., via a search bar 102 or a natural language input box of a user interface, or via a verbal input, or any other user input) a natural language input specifying a plurality of search terms directed to a dataset.
- the plurality of search terms includes a first search term and a second search term.
- the second search term is subsequent to the first search term in the natural language input.
- the dataset comprises a set of (e.g., a plurality of) time series data.
- the natural language input does not have another trend term between the first search term and the second search term.
- the natural language input Show me when Acme stocks went up, then flat, then down before 2020” as an example, in some instances, the first term is “up”, and the second term is “flat”. In some instances, the first term is “flat”, and the second term is “down.
- the parser 236 (or the parser module 336) identifies the trend terms as [‘up’], [‘flat’], and [‘down’].
- the parser determines that the natural language input is a multi-sequence event type based on the number of trend terms. In the case of multi-sequence event type, the number of trend terms in the natural language input is great than one. In the case of single event type, the number of trend terms in the natural language input is exactly equal to one.
- the plurality of search terms specified in the natural language input includes (2308) a third search term.
- the computing device in response to (2310) receiving the natural language input, and in accordance with a determination that the first search term and the second search term (e.g., where each of the first and second search terms can be a single-word or a multiword) specify a first sequence of data trends (e.g., in a specific order), (i) executes, for the first search term, one or more first queries against a search index to retrieve a first set of (one or more) labeled trend events; and (ii) executes, for the second search term, one or more second queries against the search index to retrieve a second set of (one or more) labeled trend events.
- a first sequence of data trends e.g., in a specific order
- Each labeled trend event in the first and second sets of labeled trend events corresponds (2312) to respective portion (that is less than all) of a respective line chart of a set of line charts (e.g., line graphs, line plots) representing the time series data and (ii) has a respective chart identifier.
- each labeled trend event of the plurality of labeled trend events is (2316) a respective labeled slope segment of a respective line chart in the set of line charts.
- each line chart in the set of line charts has (2318) the same time span.
- each line chart in the set of line charts can have the same length of time, such as 6 months, 1 year, or 3 years.
- each line chart in the set of line charts can have the same time duration, such as from January 2021 to December 2023.
- the first sequence of data trends is (2322) specified by the first search term, the second search term, and the third search term.
- the third search term is subsequent to the second search term. In some instances, the third search term precedes the first search term. Going back to the example natural language input “Show me when Acme stocks went up, then flat, then down before 2020,” in one example, the first search term is “up,” the second search term is “flat,” and the third search term is “down.” In another example, the third search term is “up,” the first search term is “flat,” and the second search term is “down.”
- the determination that the first search term and the second search term specify the first sequence of data trends includes parsing (2324) (e.g., using a semantic parser, such as parser 236 or parser module 336) the natural language input including the first search term and the second search term into a plurality of tokens, including assigning (i) a first semantic role to a first token corresponding to the first search term and (ii) a second semantic role to a second token corresponding to the second search term; and determining, based on the assigned first and second semantic roles, that the first search term and the second search term specify the first sequence of data trends.
- parsing e.g., using a semantic parser, such as parser 236 or parser module 336) the natural language input including the first search term and the second search term into a plurality of tokens, including assigning (i) a first semantic role to a first token corresponding to the first search term and (ii) a second semantic role to a second token corresponding to the
- parsing the natural language input includes determining (2326) that an event type corresponding to the natural language input is a multisequence event type.
- the computing device constructs (2328) one or more sequences of labeled trend events based on the retrieved first and second sets of labeled trend events.
- constructing the one or more sequences of labeled trend events based on the retrieved first and second sets of trend events includes: for each sequence of labeled trend events, joining (2330) (e.g., via a SQL join) a respective first labeled trend event corresponding to the first search term and a respective second labeled trend event corresponding to the second search term, according to (i) a respective chart identifier corresponding to the respective first labeled trend event and the respective second labeled trend event and (ii) respective start and end dates of the respective first labeled trend event and the respective second labeled trend event.
- a tunable parameter is included to allow for some temporal delay (or a temporal interlude) between adjacent events.
- the tunable parameter allows for "a temporal interlude between adjacent events in a sequence," the length of which is chosen as appropriate for the data and context.
- the "tunable parameter” can be set to one day, three days, one week, or any amount of time a user would like to allow between events for them to still be considered a sequence. This means that a sequence search for "up, down” would return sequences composed of an "up” event followed by a "down” event with an allowance for an intervening day, week, etc. between the end of the "up” event and the start of the "down” event.
- the temporal delay can be an arbitrarily long or short temporal delay. In practice, the amount of temporal delay can be selected based on what is appropriate for the data and domain.
- the computing device in accordance with a determination that the constructed one or more sequences of labeled trend events are partial sequence matches of the natural language input, determines (2332), for each sequence of the one or more sequences, a respective sequence score based at least in part on (i) a number of events in the respective sequence and (ii) a respective sequence offset (the respective sequence offset is the number of sequential events missing from the beginning of the sequence compared to the query).
- sub-sequence “up, flat” has zero offset because it begins at the same place as the initial query pattern, but “flat, down” has an offset of one since begins starts one event later in the sequence.
- sub-sequence partial matches that begin similarly to the desired sequence from the query should be scored higher than those that end similarly to the desired sequence.
- the computing device determines (2334), for each sequence of the one or more sequences, a respective sequence score by aggregating (summing) one or more respective composite scores corresponding to one or more respective labeled trend events in the respective sequence.
- the computing device determines (2336) a respective composite score for the respective labeled trend event based on (1) a respective label score representing an extent to which a respective search term matches respective labels of the plurality of labeled trend events and (2) a respective visual saliency score.
- the visual saliency score quantifies the perceptual prominence of a trend event.
- the respective composite score is (2338) a product (e.g., a multiplication) of the respective label score and the respective visual saliency score.
- determining the respective composite score includes computing (2340) the respective label score according to (i) a frequency with which the first search terms occur in the respective labeled trend event and (ii) a label length of the respective labeled trend event.
- each line chart in the set of time series line charts is a plot of data values of a data field (or changes in data values of a data field) over a predefined timespan.
- Determining the respective composite score includes computing (2342) the respective visual saliency score according to (1) a temporal duration of the respective portion of the respective line chart relative to the predefined timespan and (ii) a first difference in the data values of the data field over the temporal duration relative to a second difference in the data values of the data field over the predefined timespan.
- the computing device assigns (2344) each sequence of labeled trend events, of the one or more sequences of labeled trend events, into one or more groups according to the respective chart identifier.
- the computing device determines (2346), for each group of the one or more groups, a respective final score.
- the respective final score for each group of the one or more groups is (2348) an aggregation of one or more respective sequence scores, from one or more respective sequences of labeled trend events, in the respective group.
- the computing device ranks (2350) the one or more groups according to one or more determined final scores.
- the computing device retrieves (2352), from the dataset, data corresponding to a subset of (one or more) line charts having the respective chart identifiers of the ranked groups in accordance with the ranking.
- the computing device generates (2354) the subset of line charts.
- the computing device annotates (2356) (e.g., via color-encoding or labelencoding) respective segments of the subset of line charts that correspond to the sequences of labeled trend events; and
- the computing device displays (2358) one or more line charts of the subset of line charts as annotated.
- the methods disclosed herein comprise one or more steps or actions for achieving the described method.
- the method steps and/or actions may be interchanged with one another without departing from the scope of the claims.
- the order and/or use of specific steps and/or actions may be modified without departing from the scope of the claims.
- the term “plurality” denotes two or more. For example, a plurality of components indicates two or more components.
- the term “determining” encompasses a wide variety of actions and, therefore, “determining” can include calculating, computing, processing, deriving, investigating, looking up (e.g., looking up in a table, a database or another data structure), ascertaining and the like. Also, “determining” can include receiving (e.g., receiving information), accessing (e.g., accessing data in a memory) and the like. Also, “determining” can include resolving, selecting, choosing, establishing and the like. [00282] The phrase “based on” does not mean “based only on,” unless expressly specified otherwise. In other words, the phrase “based on” describes both “based only on” and “based at least on.”
- exemplary means “serving as an example, instance, or illustration,” and does not necessarily indicate any preference or superiority of the example over any other configurations or embodiments.
- the term “and/or” encompasses any combination of listed elements.
- “A, B, and/or C” entails each of the following possibilities: A only, B only, C only, A and B without C, A and C without B, B and C without A, and a combination of A, B, and C.
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims
Applications Claiming Priority (8)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US202363463055P | 2023-04-30 | 2023-04-30 | |
| US63/463,055 | 2023-04-30 | ||
| US202363543070P | 2023-10-07 | 2023-10-07 | |
| US63/543,070 | 2023-10-07 | ||
| US18/426,186 US12216678B2 (en) | 2023-04-30 | 2024-01-29 | Search tool for exploring quantifiable trends in line charts |
| US18/426,186 | 2024-01-29 | ||
| US18/426,192 | 2024-01-29 | ||
| US18/426,192 US12511307B2 (en) | 2023-04-30 | 2024-01-29 | Systems and methods for exploring quantifiable trends in line charts |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2024229036A1 true WO2024229036A1 (en) | 2024-11-07 |
Family
ID=91616675
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2024/027076 Pending WO2024229036A1 (en) | 2023-04-30 | 2024-04-30 | Systems and methods for exploring quantifiable trends in line charts |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2024229036A1 (en) |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US10867256B2 (en) * | 2015-07-17 | 2020-12-15 | Knoema Corporation | Method and system to provide related data |
| US20220318261A1 (en) * | 2021-03-30 | 2022-10-06 | Tableau Software, LLC | Implementing a Visual Analytics Intent Language Across Multiple Devices |
| US11604800B1 (en) * | 2021-07-28 | 2023-03-14 | International Business Machines Corporation | Generating a visualization of data points returned in response to a query based on attributes of a display device and display screen to render the visualization |
-
2024
- 2024-04-30 WO PCT/US2024/027076 patent/WO2024229036A1/en active Pending
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US10867256B2 (en) * | 2015-07-17 | 2020-12-15 | Knoema Corporation | Method and system to provide related data |
| US20220318261A1 (en) * | 2021-03-30 | 2022-10-06 | Tableau Software, LLC | Implementing a Visual Analytics Intent Language Across Multiple Devices |
| US11604800B1 (en) * | 2021-07-28 | 2023-03-14 | International Business Machines Corporation | Generating a visualization of data points returned in response to a query based on attributes of a display device and display screen to render the visualization |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US12007988B2 (en) | Interactive assistance for executing natural language queries to data sets | |
| US8935249B2 (en) | Visualization of concepts within a collection of information | |
| US8051073B2 (en) | System and method for measuring the quality of document sets | |
| US20180082183A1 (en) | Machine learning-based relationship association and related discovery and search engines | |
| US12182514B2 (en) | Automatic synonyms using word embedding and word similarity models | |
| CA3077454C (en) | Methods, systems, and computer-readable media for semantically enriching content and for semantic navigation | |
| US20240338378A1 (en) | Semantic Search Interface for Data Repositories | |
| SanJuan et al. | A symbolic approach to automatic multiword term structuring | |
| Rajman et al. | From text to knowledge: Document processing and visualization: A text mining approach | |
| US20250252124A1 (en) | System and method for integrating artificial intelligence assistants with website building systems | |
| US12216678B2 (en) | Search tool for exploring quantifiable trends in line charts | |
| Hamroun et al. | Efficient text-based query based on multi-level and deep-semantic multimedia indexing and retrieval | |
| US20200293574A1 (en) | Audio Search User Interface | |
| WO2024229036A1 (en) | Systems and methods for exploring quantifiable trends in line charts | |
| US20250117423A1 (en) | Systems and methods for supporting sketch-based querying for data trend analysis | |
| CN121464434A (en) | System and method for exploring quantifiable trends in line graphs | |
| Spahiu et al. | Understanding the structure of knowledge graphs with ABSTAT profiles | |
| Cheng et al. | Topexplorer: Tool support for extracting and visualizing topic models in bioengineering text corpora | |
| Urbain et al. | A semantic and content-based search user interface for browsing large collections of Foley sounds | |
| WO2024211835A1 (en) | Semantic search interface for data repositories | |
| Bernasconi et al. | Linked Data Interfaces: A Survey. Information 2023, 14, 483 | |
| Agi et al. | An Enhanced Model for the Classification of Mined Data | |
| Li | Computational approach for identifying and visualizing innovation in patent networks | |
| Rajman¹ et al. | From Text to Knowledge: Document Processing and | |
| Omizo et al. | Digging text viz: an archaeological review of ACM digital library text visualizations publications (1991--2003) |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 24735038 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 2024735038 Country of ref document: EP Effective date: 20251201 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2024735038 Country of ref document: EP |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| ENP | Entry into the national phase |
Ref document number: 2024735038 Country of ref document: EP Effective date: 20251201 |
|
| ENP | Entry into the national phase |
Ref document number: 2024735038 Country of ref document: EP Effective date: 20251201 |