WO2019147968A1

WO2019147968A1 - Real time multi variate time series search

Info

Publication number: WO2019147968A1
Application number: PCT/US2019/015197
Authority: WO
Inventors: Mahadevan Balasubramaniam; Arun Karthi SUBRAMANIYAN
Original assignee: GE Inspection Technologies LP
Current assignee: Waygate Technologies USA LP
Priority date: 2018-01-26
Filing date: 2019-01-25
Publication date: 2019-08-01
Anticipated expiration: 2020-07-26
Also published as: CN111989661A; RU2020127289A; RU2020127289A3; SG11202007063PA; EP3743825A4; EP3743825A1

Abstract

A string representation of a query signal of a data set storing a string representation of a stream of time series data is received. The time series data is generated by a machine asset. An indexed data set is searched for an occurrence of the query signal by at least determining a distance between the query signal and portions of the indexed data set. The occurrence of the query signal within the indexed data set is provided. Related apparatus, systems, techniques and articles are also described.

Description

Real Time Multi Variate Time Series Search

CROSS-REFERENCE TO RELATED APPLICATION

[0001] The present application claims priority to U.S. Provisional

Application No. 62/622,739, filed on January 26, 2018 in the U.S. Patent and Trademark Office, the entire disclosure of which is hereby incorporated by reference.

BACKGROUND

[0002] In any industrial asset, there may be large quantities of data being acquired during operation, for example, from sensors and/or operational parameters. In some cases, up to about 97% of the acquired data can go unused due to lack of tools that can utilize the data for troubleshooting. As an example, troubleshooting can focus on specific installations where a problem has occurred around a failure time period.

[0003] In order to search time series data produced by industrial assets for patterns, a search can be performed by extracting the data from a database and then operating on it to search for patterns. Such searches are time consuming and require significant processing power.

SUMMARY

[0004] The subject matter described herein relates to real time multi variate time series search. In an aspect, a string representation of a query signal of a data set storing a string representation of a stream of time series data is received. The time series data is generated by a machine asset. An indexed data set is searched for an occurrence of the query signal by at least determining a distance between the query signal and portions of the indexed data set. The occurrence of the query signal within the indexed data set is provided. Related apparatus, systems, techniques and articles are also described.

[0005] Non-transitory computer program products (e.g., physically embodied computer program products) are also described that store instructions, which when executed by one or more data processors of one or more computing systems, cause at least one data processor to perform operations herein. Similarly, computer systems are also described that may include one or more data processors and memory coupled to the one or more data processors. The memory may temporarily or permanently store instructions that cause at least one processor to perform one or more of the operations described herein. In addition, methods can be implemented by one or more data processors either within a single computing system or distributed among two or more computing systems. Such computing systems can be connected and can exchange data and/or commands or other instructions or the like via one or more connections, including a connection over a network (e.g. the Internet, a wireless wide area network, a local area network, a wide area network, a wired network, or the like), via a direct connection between one or more of the multiple computing systems, etc.

[0006] The details of one or more variations of the subject matter described herein are set forth in the accompanying drawings and the description below. Other features and advantages of the subject matter described herein will be apparent from the description and drawings, and from the claims.

DESCRIPTION OF DRAWINGS

[0007] FIG. 1 is a process flow diagram illustrating an example process flow diagram a process of searching an indexed data set with a query signal (e.g., pattern) represented as a string, which is a compressed representation of the time series data.

[0008] FIG. 2 is a plot showing example time series data and the corresponding example string representation of that time series data.

[0009] FIG. 3 illustrates another example of time series data represented as a string.

[0010] FIG. 4 is a plot illustrating the distance between two signals quantized to levels corresponding to their string representation.

[0011] FIG. 5 illustrates the time series and query signal respective string representations.

[0012] FIG. 6 is a functional block diagram illustrating an example framework for a time series pattern search system.

[0013] FIG. 7 is two plots illustrating normalized reference data.

[0014] FIG. 8 is a plot illustrating an example L2 norm distance for an example query.

[0015] FIG. 9 illustrates example time series data with highlighted portions indicating that a query signal was found within the time series data.

[0016] Like reference symbols in the various drawings indicate like elements.

DETAILED DESCRIPTION

[0017] The current subject matter can enable generic time series pattern searches on datasets, in some instances, massive datasets. In some implementations, time series data relating to industrial machines (such as oil and gas wells, turbines, refineries, and the like) can be indexed as it is stored. As an example, the time series signal can be compressed to or represented as a string representation. Patterns can be detected by searching through the dataset using the string representation as an index. In some implementations, the system interface can allow a user to select any portion of a multi-variate time series signal and to quickly search through very large datasets to identify similar patterns. Some implementations can be orders of magnitude faster than currently available industry standard practices for searching through time series datasets. By having a rapid way of querying similar patterns in the time series, a huge wealth of information about the onset of industrial machine failures or troubling operation prior to it becoming severe can be provided.

[0018] FIG. 1 is an example process flow diagram 100 illustrating a process for searching an indexed data set with a query signal (e.g., pattern) represented as a string, which is a compressed representation of the time series data. Because the process illustrated in FIG. 1 can be implemented as a rapid way of querying similar patterns in a time series, information about the onset of industrial machine failures or troubling operation prior to it becoming severe can be provided.

[0019] At 110, data characterizing a string representation of a query signal is received. The query signal can be of a data set that is storing a string representation of time series data generated by machine asset such as, for example, an industrial machine asset. For example, if the time series data is of rotations per minute of a turbine that varies over a range of values that are represented as, e.g., float or double data types, those values can be transformed (e.g., encoded) into string values taken from a fixed set of string values.

[0020] FIG. 2 is a plot 200 showing example time series data and the corresponding example string representation of that time series data. By transforming the time series data to a string representation, the amplitude values of the time series data are effectively quantized to discrete levels corresponding to one of a number of string values. In the example of FIG. 2, the amplitude of the time series data is quantized to one of three values (“a”,“b”, or“c”).

[0021] FIG. 3 is a plot 300 illustrating another example of time series data represented as a string. The time series data (amplitude value on left axis) can be assigned one value from the set of“a”,“b”,“c”,“d”,“e”, and“f’ (the assignment scheme is illustrated on the right axis). This approach can be advantageous as each symbol can require fewer bits than real-numbers (e.g., float, double, and the like). In addition, in some implementations, nearly a 100 times compression can be possible without substantial loss in fidelity and the approach can be noise immune. This can be achieved because in some implementations, the time series data is not pre-filtered to suppress certain features or characteristics of the data.

[0022] Referring again to FIG. 1, at 120, the indexed data set is searched for an occurrence of the query signal. As an example, the indexed data set can be searched by at least determining a distance between the query signal and portions of the indexed data set. Distance may be computed as a distance between the string characters in the query signal and the indexed data set using a sliding window approach. The distance can include the Euclidean distance or the piecewise aggregate approximation (PAA) distance, which can be used when the time series is discretized and aggregated. Other measures of distance can be used. For example, FIG. 4 is a plot 400 illustrating the distance between two signals quantized to levels corresponding to their string representation and FIG. 5 is a plot 500 illustrating the time series and query signal respective string representations. The Euclidean distance between“a” and“b” is 1, whereas the Euclidean distance between“a” and“d” is 3. For the representation of FIG. 4, this can be expressed formally as

where Q is the query signal, and C is the indexed data set. Where the signals are represented as strings, the Euclidean distance can be expressed formally as

where dist() returns the integer separation between two strings. In some implementations, determining the distance between two strings can be implemented with a table look up where the distance between each pair of strings or characters is included in the table. A table lookup can be O(constant) evaluation, which can be performed quickly.

[0023] Referring again to FIG. 1, at 130, the occurrence of the query signal within the indexed data set can be provided. The occurrence can be represented as, e.g., a time index within the data set at which the pattern occurs. The occurrence can include the time indices associated with the minimum computed distance. In some implementations, multiple occurrences are provided, for example the indices associated with all computed distances that are below a predetermined value or the smallest N distances, where N is predetermined.

[0024] Providing the occurrence can include displaying the time series data in a manner that highlights the portion of the time series that resulted in a match. For example, FIG. 9 illustrates example time series data with highlighted portions indicating that a query signal was found within the time series data.

[0025] The provided occurrence characterizes the presence or matching of a pattern within the data set, which can represent the occurrence of an event. This can be used, for example, where there are many assets in the field producing data over multiple years. Once a match of a pattern in a given asset is determined, which and whether other assets had this same event previously and when can be determined.

How those identified assets deteriorated over time can be determined and as well as the consequence of the occurrence of the event. This analysis and learning can be used to take corrective action to prevent the current asset from having similar problems. In other words, the current subject matter can enable industrial machine operators to identify potential operational problems early and take appropriate action before further damage or performance loss occurs.

[0026] FIG. 6 is a functional block diagram 600 illustrating an example framework for a time series pattern search system. A query signal can be received in multi-variate form and transformed to a string representation, which is an approximate of the multi- variate query signal. In addition, incoming data is received (e.g., from the industrial machine), is transformed to a string representation, and both the incoming data and the string representation is stored in a database as an indexed data set. To perform the search, the query signal approximate can be compared using a sliding window to the indexed data set and a measure of distance is computed at each position of the sliding window. This approach enables indexing of time series data as it is received (e.g., indexing is ongoing and the dataset is preprocessed before a search is performed) and at scale (e.g., pattern matching can be performed across many large data sets simultaneously).

[0027] The current subject matter can perform pattern searches over very large data sets quickly. This can be achieved by indexing the dataset at ingestion (e.g., receipt and storage) of the data set. By pre-indexing, significant gains in query speed can be achieved.

[0028] In some implementations, the window size can be determined dynamically or predetermined. The window size can be determined based on the particular application (e.g., the underlying time series data). For example, the window size can be determined based on a length of time of an event that is being searched for. In addition, the sliding window can have a varying stride length (e.g., the number of samples that the window moves between distance computations), which can impact both detection rates and query speed performance.

[0029] The current subject matter can be advantageous in that the indexing can also serve as a compression (e.g., encoding) scheme. By changing the

representation of the time series data to a string format, the indexed data set can be notably small in size than the original time series data. The level of compression and fidelity of the compression can be varied by changing the number of levels (e.g., quantization levels such as number of string characters used) used to represent the time series data in string format.

[0030] In some implementations, data can be scaled for unit variance. This can be performed because given time series data may have dynamic amplitude ranges and events may occur with different amplitudes. By scaling (e.g., normalizing) the time series data and query signal when creating the string representation, pattern detection can be improved. For example, FIG. 7 is two plots 700 and 750 illustrating normalized reference data.

[0031] FIG. 8 is a plot 800 illustrating an example L2 norm distance for an example query. The string representation transform results in flat level

approximations of the data. As a result, the distance computation may not specifically identify the start of a pattern match (e.g., event). As a result, additional processing can be performed to detect the index corresponding to the start of the pattern. This can be achieved by detecting sequence of distances that have the same value and consider them as a single match. For example, if the distance is computed as“3, 2, 3, 1, 1, 1, 2, 3”; then the series of“1, 1, 1” can be treated as a single match (e.g., because it is flat) instead of three matches. Combining can be performed based on the window size to report the match location as a single match instead of multiplicity of values.

[0032] In some implementation, flat level approximations of the data can be processed by identifying the minimum distance values; of all the indexes returned, a first order difference of index locations can be determined; and in response to index locations delta being equal to 1, those locations can be combined.

[0033] Although a few variations have been described in detail above, other modifications or additions are possible. For example, in some implementations, the search techniques can be parallelized to operate at scale. The current subject matter can be multivariable enabled and can include a framework to process data (including indexing and searching) quickly (e.g., in near real-time). Indexing can be performed independently for segments of the time series data and in parallel.

Searching for a query in a time series index can be performed in parallel by splitting time series and parallelizing the processing.

[0034] The subject matter described herein can provide many technical advantages. For example, some implementations of the current subject matter can normalize and index datasets at the time of ingestion; assemble the query string and time series data based on physical reality of data (window size, stride length, etc..); perform univariate distance search in string domain; compute the multi variate distance using L2 norm; search for the min distance along the time series; highlight locations that satisfy the threshold of minimum; and the like.

[0035] Further, some implementations of the current subject matter can enable compression of time series during ingestion by indexing and searching through indexed data rapidly; an ability to determine a quantitative metric for distance between query and time series in a multivariate domain; accelerated computing at scale using parallel computing; and the like. The current subject matter can enable searching through massive time series data sets with a low computational burden and at scale (e.g., across a large range of data sets). Because the current subject matter does not rely on some transformations, such as the fast Fourier transform or wavelet representation, the current subject matter can be less prone to noise and it can scale better to larger data sets. Further, some implementations of the current subject matter can include normalizing the data set space and determining the string approximation for a preset number of levels; creating an index and store the index as a function of time into the same dataset; assembling the query and roll it (e.g., slide) through the indexed dataset. Some implementations can also include a multi-variate approach where a distance metric can be determined as the query sweeps through the dataset, in contrast to univariate based approaches.

[0036] One or more aspects or features of the subject matter described herein can be realized in digital electronic circuitry, integrated circuitry, specially designed application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs) computer hardware, firmware, software, and/or combinations thereof. These various aspects or features can include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which can be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device. The programmable system or computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

[0037] These computer programs, which can also be referred to as programs, software, software applications, applications, components, or code, include machine instructions for a programmable processor, and can be implemented in a high-level procedural language, an object-oriented programming language, a functional programming language, a logical programming language, and/or in assembly/machine language. As used herein, the term“machine-readable medium” refers to any computer program product, apparatus and/or device, such as for example magnetic discs, optical disks, memory, and Programmable Logic Devices (PLDs), used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term“machine-readable signal” refers to any signal used to provide machine instructions and/or data to a programmable processor. The machine-readable medium can store such machine instructions non-transitorily, such as for example as would a non-transient solid-state memory or a magnetic hard drive or any equivalent storage medium. The machine -readable medium can alternatively or additionally store such machine instructions in a transient manner, such as for example as would a processor cache or other random access memory associated with one or more physical processor cores.

[0038] To provide for interaction with a user, one or more aspects or features of the subject matter described herein can be implemented on a computer having a display device, such as for example a cathode ray tube (CRT) or a liquid crystal display (LCD) or a light emitting diode (LED) monitor for displaying information to the user and a keyboard and a pointing device, such as for example a mouse or a trackball, by which the user may provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well. For example, feedback provided to the user can be any form of sensory feedback, such as for example visual feedback, auditory feedback, or tactile feedback; and input from the user may be received in any form, including acoustic, speech, or tactile input. Other possible input devices include touch screens or other touch-sensitive devices such as single or multi-point resistive or capacitive trackpads, voice recognition hardware and software, optical scanners, optical pointers, digital image capture devices and associated interpretation software, and the like.

[0039] In the descriptions above and in the claims, phrases such as“at least one of’ or“one or more of’ may occur followed by a conjunctive list of elements or features. The term“and/or” may also occur in a list of two or more elements or features. Unless otherwise implicitly or explicitly contradicted by the context in which it is used, such a phrase is intended to mean any of the listed elements or features individually or any of the recited elements or features in combination with any of the other recited elements or features. For example, the phrases“at least one of A and B;”“one or more of A and B;” and“A and/or B” are each intended to mean“A alone, B alone, or A and B together.” A similar interpretation is also intended for lists including three or more items. For example, the phrases“at least one of A, B, and C;”“one or more of A, B, and C;” and“A, B, and/or C” are each intended to mean“A alone, B alone, C alone, A and B together, A and C together, B and C together, or A and B and C together.” In addition, use of the term“based on,” above and in the claims is intended to mean,“based at least in part on,” such that an unrecited feature or element is also permissible. [0040] The subject matter described herein can be embodied in systems, apparatus, methods, and/or articles depending on the desired configuration. The implementations set forth in the foregoing description do not represent all implementations consistent with the subject matter described herein. Instead, they are merely some examples consistent with aspects related to the described subject matter. Although a few variations have been described in detail above, other modifications or additions are possible. In particular, further features and/or variations can be provided in addition to those set forth herein. For example, the implementations described above can be directed to various combinations and sub-combinations of the disclosed features and/or combinations and sub-combinations of several further features disclosed above. In addition, the logic flows depicted in the accompanying figures and/or described herein do not necessarily require the particular order shown, or sequential order, to achieve desirable results. Other implementations may be within the scope of the following claims.

Claims

WHAT IS CLAIMED IS:

1. A method comprising:

receiving a string representation of a query signal of a data set storing a string representation of a stream of time series data, the time series data generated by a machine asset;

searching an indexed data set for an occurrence of the query signal by at least determining a distance between the query signal and portions of the indexed data set; and providing the occurrence of the query signal within the indexed data set.