WO2019008394A1

WO2019008394A1 - Digital information capture and retrieval

Info

Publication number: WO2019008394A1
Application number: PCT/GB2018/051935
Authority: WO
Inventors: Marc Sloan; Andrew O'HARNEY; Matteus TANHA; Alberto CETOLI; Stefano BRAGAGLIA
Original assignee: Cscout Ltd
Current assignee: Cscout Ltd
Priority date: 2017-07-07
Filing date: 2018-07-06
Publication date: 2019-01-10
Anticipated expiration: 2020-01-07

Abstract

The present invention relates to a computer-implemented method of processing information during a user-performed task, the method comprising: extracting information from at least one information source accessed by a user during a task; identifying at least one of an entity and a property associated with an entity from said extracted information; associating the identified at least one of an entity and a property associated with an entity with a stored database of entities and properties thereby to update the database; in response to a user query related to a particular entity, extracting information relevant to the particular entity from the database; and providing said information relevant to the particular entity to the user.

Description

Digital information capture and retrieval

The present invention relates to the capture and retrieval of digital information, for example on the internet. In particular, the invention provides a digital autonomous system and method for online knowledge work capture, management, and assistance, for example, to a user of a web browser.

Conventional web and enterprise search providers (e.g. Google, Bing, Algolia, Elastic & Swiftype) require textual natural language input from a user, and typically output lists of web URLs in reply. The technology currently employed focuses on document retrieval mechanisms that use web-crawlers to index data on the web to allow it to be retrieved for the user.

To enhance its search results, Google uses a knowledge base it calls the Knowledge Graph to provide structured and detailed information about a searched topic. The "enhanced" information is gathered from a wide variety of sources, and may include a list of links to other potentially related websites. For example, entering the name of a company into the Google web search engine typically results in a summary of information relevant to that company, which, in addition to providing a simple overview (e.g. a brief description) of the company, might include "enhanced" information relating to key personnel, subsidiaries, contact details, current stock price, links to related websites, and so on.

The "enhanced" information output in response to a search query is, however, generic in the sense that it is not tailored to a specific task being performed by the user of the web browser, but rather contained to the search session without consideration for the overall task. The present invention provides an improved system and method for the capture and retrieval of information relevant to a task being performed by a user of the web browser.

Aspects and embodiments of the present invention are set out in the appended claims. These and other aspects and embodiments of the invention are also described herein.

According to at least one aspect described herein, there is provided a method, in a data processing system comprising a processor and a memory, for automated retrieval of stored digital information during a user-performed task, comprising: receiving, by the data processing system, digital information accessed by a user during a current task; classifying, by the processor, the current task based on at least one of: current digital information and previous digital information received in relation to the user; comparing, by the processor, the current task against previous tasks stored on the memory having a similar classification to identify whether one or more stored previous tasks relate to the user; determining, by the processor, whether any of the identified stored previous tasks contain entities and/or relations corresponding to the current task; and upon positive determination, by the processor, providing the user with digital information extracted from the one or more identified stored previous tasks.

According to another aspect, there is provided a method for automated retrieval of stored digital information during a user-performed task, the method comprising: receiving digital information accessed by a user during a task; determining at least one entity based on said received digital information; receiving further digital information accessed by a user during a task; determining a property of said entity based on said further digital information; collating digital information associated with said entity; and providing said collated digital information to the user.

According to another aspect, there is provided a computer-implemented method of processing information during a user-performed task, the method comprising: extracting information from at least one information source accessed by a user during a task; identifying at least one of an entity and a property associated with an entity from said extracted information; associating the identified at least one of an entity and a property associated with an entity with a stored database of entities and properties thereby to update the database; in response to a user query related to a particular entity, extracting information relevant to the particular entity from the database; and providing said information relevant to the particular entity to the user.

In such a way useful data can be presented to a user automatically while they are undertaking a task. Optionally, a task comprises an information gathering task for a particular purpose, using at least one information source. Optionally, the at least one information source is accessed by the user via a network connection, such as via the Internet. Optionally, information is extracted automatically from the accessed at least one information source Optionally, said property comprises further information about said determined entity. Optionally, said further information about said determined entity comprises: a location, contact details, a skill, a role, a sector, an investment, or a document. Optionally, said property comprises an entity related to said determined entity; optionally said related entity comprises: a company, a person, a social media profile, a product, or a project.

For accuracy; the method may comprise weighting the related entities and/or properties according to the relevance and/or confidence associated with said entities and/or properties.

Optionally, the digital information relates to at least one webpage. For interoperability and/or ease of use, the information being retrieved relating to the webpage may be HTML content; optionally the information further comprises the webpage URL; optionally, the information further comprises actions taken by the user while viewing the webpage. For accuracy, a sequence of accessed websites may be mapped to a vector, for example thereby creating task/workflow embedding vectors.

Optionally, the method may further comprise comparing a current task against previous tasks stored on the memory having a similar classification to identify whether one or more stored previous tasks relate to the current task and/or user.

So as to provide relevant information, the method may further comprise comparing the current task against previous tasks comprises identifying a primary entity corresponding to the current task, and searching for said primary entity in the stored previous tasks. Optionally, comparing the current task against previous tasks may comprise measuring the statistical similarity of the current task and one or more previous tasks, optionally using a trained classifier. For ease of use / interoperability, the information relating to the webpage may be retrieved by a (for example, plug-in) extension to the web browser and sent to the data processing system. For security, each user may be identified by an anonymous encrypted key.

Optionally, the method may further comprise converting received digital information into a predetermined ontology. Optionally, the digital information is received as one or more first class objects. Optionally, receiving digital information comprises identifying entities and/or relations in the information. Optionally, identifying entities and/or relations in the information comprises comparing the information against a predetermined mapping. Optionally, identifying entities and/or relations in the information comprises using Named Entity Recognizers. For accuracy, the method may further comprise allocating at least one of a score and a weighting to said entity identified in the information based on a confidence rating that said entity is accurately identified. Optionally, said further digital information accessed by a user during a task comprises information accessed by a user during the same task as said received digital information.

Optionally, the method further comprises determining an entity representative of said task. Optionally, determining an entity representative of said task comprises determining an entity highly connected to other entities, weighting entities on their relevance to the task, and/or receiving an indication from the user.

Optionally, said further digital information accessed by a user during a task comprises information accessed by a user during a previous task. Optionally, the further digital information is received in relation to the user. Optionally, the further digital information is received in relation to other users.

For accuracy, said previous task may be selected from a number of previous tasks in dependence on the relevance of the previous task to the current task. Optionally, the relevance of said previous task is determined on the basis of a primary entity of said current task being present in the previous task. Optionally, the relevance of said previous task is determined on the basis of connections between primary entities of said tasks. Optionally, the relevance of said previous task is determined on the basis a measure of the similarity of workflows. Optionally, the relevance of said previous task is determined on the basis on a comparison of the websites visited during each task.

Optionally, the method further comprises classifying said task based on said received digital information and/or said received further digital information.

Optionally the method further comprises predicting, by the processor, user-desired information based on at least one of: current digital information and previous digital information; querying an external data source for external information relevant to the predicted user-desired information; and upon positive determination of external information relevant to the predicted user-desired information, receiving said external information into the memory.

Optionally, the method further comprises determining that a task is underway. Optionally, the method further comprises associating the identified at least one of an entity and a property associated with an entity with the current task. Optionally, the method further comprises associating at least one information source with a particular task. Optionally, the database is a graph database. Optionally, providing said information relevant to the particular entity to the user comprises using a user interface.

According to another aspect, there is provided a user interface, configured to: retrieve digital information from a webpage being accessed by a user performing a task on a web browser; transmit said retrieved digital information to a data processing system; receive stored digital information from the data processing system; and output said received stored digital information to said user; wherein the digital information output to the user is continually updated during performance of the task based on the webpages accessed by the user.

Optionally, for ease of use, the user interface outputs the digital information within the web browser. Optionally, for ease of use, the output is in the form of a user interface element that is configured to display different digital information according to the type of information, for example web-links, email addresses, free text. Optionally, the user interface comprises a web browser extension arranged to communicate digital information with the web browser. Optionally, the means for providing said information relevant to the particular entity to the user comprises a user interface as described herein.

According to another aspect, there is provided a system for automated retrieval of stored digital information during a user-performed task, comprising: means for receiving digital information accessed by a user during a current task; means for classifying, by the processor, the current task based on at least one of: current digital information and previous digital information received in relation to the user; means for comparing the current task against previous tasks stored on the memory having a similar classification to identify whether one or more stored previous tasks relate to the user; means for determining whether any of the identified stored previous tasks contain entities and/or relations corresponding to the current task; and means for providing the user, upon positive determination, with digital information extracted from the one or more identified stored previous tasks.

According to another aspect, there is provided a system for automated retrieval of stored digital information during a user-performed task, the system comprising: means for receiving digital information accessed by a user during a task; means for determining at least one entity based on said received digital information; means for receiving further digital information accessed by a user during a task; means for determining a property of said entity based on said further digital information; means for collating digital information associated with said entity; and means for providing said collated digital information to the user. Optionally, the system comprises a computing device in communication with a data processor, wherein the computing device is configured to capture digital information accessed by a user during a current task and to send the captured information to the data processor. Optionally, the digital information accessed by the user is on a webpage, preferably wherein said digital information comprises at least one of (HTML) content and the URL. Optionally, the computing device is configured to allow the user to access the webpage via a web browser. Optionally, the web browser comprises a (for example, plug-in) extension that is configured to capture the digital information on the web page.

According to another aspect, there is provided a method for classifying user-performed tasks, comprising: receiving a sequence of user accessed websites corresponding to a user performed task; mapping said sequence of user accessed websites to a classification vector; and classifying said user-performed task as a particular task in dependence on a confidence level associated with said classification vector. According to another aspect, there is provided a computer-implemented method of classifying a sequence of user accessed websites in accordance with user-performed tasks, comprising: receiving a sequence of user accessed websites; and using a trained classifier, classifying sub-sequences of user accessed websites as particular user-performed tasks. In such a way, a user's task can be automatically classified which can lead to automatically providing the user with information relevant to the task.

Optionally, for accuracy and/or efficiency classifying said task comprises using a trained classifier. Optionally, the trained classifier comprises a recurrent neural network. Optionally, for accuracy, the method further comprises training the classifier using a labelled sequence of website vectors as an input, thereby to build a(n internal) representation of the task.

Optionally, the method further comprises projecting said representation of the task onto said classification vector. Optionally, the method further comprises training the classifier to classify a sequence as belonging to a predefined community, and how well a sequence belongs to a classification. For ease and/or efficiency of calculation, said sequence of user accessed website is represented as a vector. Optionally, the method further comprises splitting the sequence into at least one sub-sequence of accessed websites. Optionally, said at least one sub-sequence is mapped to a classification vector. For accuracy, the sub-sequences of website vectors may be iteratively broken or joined to reach an optimal classification quality.

Optionally, the method further comprises determining a community of websites, said community comprising one or more webpages relating to a particular category of information. Optionally, the confidence level associated with said classification vector comprises a measure of prediction accuracy; optionally said measure of prediction accuracy comprises the perplexity of said classification vector. Optionally, said classification vector comprises a list of probabilities of the sequence belonging to a specific class. Optionally, the method further comprises determining the start and/or end of said task. According to another aspect, there is provided a system for classifying user-performed tasks, comprising: means for receiving a sequence of user accessed websites corresponding to a user performed task; means for mapping said sequence of user accessed websites to a classification vector; and means for classifying said user-performed task as a particular task in dependence on a confidence level associated with said classification vector.

According to another aspect, there is provided a method for predictive searching of databases, comprising: receiving digital information accessed by a user during a task into a memory; determining relevant data related to said task not stored within said memory; retrieving said relevant data from an external data source; and presenting said relevant data to the user; wherein said relevant data is determined in dependence on said data related to said task stored within said local memory.

According to another aspect, there is provided a computer-implemented method of predictive searching of at least one information source, comprising: extracting information from at least one information source accessed by a user during a task into a database; using the information in the database, identifying further information that is likely to be of relevance to the task, wherein the further information is not included in the information in the database; extracting the further information from at least one information source into the database; and presenting said further information to the user. In such a way, relevant data can be presented to a user without them having to proactively search for it.

For accuracy, the method may further comprise determining a classification of the task based on said received digital information; wherein said relevant data related to said task not stored within said memory is determined in dependence on said determined task classification. Optionally, determining a classification of the task based on said received digital information comprises: receiving a sequence of user accessed websites corresponding to a user performed task; mapping said sequence of user accessed websites to a classification vector; and classifying said user-performed task as a particular task in dependence on a confidence level associated with said classification vector.

Optionally, said relevant data related to said task not stored within said memory is determined in dependence on one or more identified entities in the contents of the memory. Optionally, the method further comprises predicting a future task in dependence on said task and wherein relevant data related to said task not stored within said memory is determined in dependence said predicted future task.

For accuracy of relevant data, the memory may comprise digital information related to previous tasks and/or tasks performed by other users. Optionally, the digital information related to previous tasks and/or tasks performed by other users is used to determining relevant data related to said task not stored within said memory. Optionally, the digital information accessed by a user during a task comprises information relating to an entity. So that the data is relevant to a primary entity, the method may further comprise identifying a primary entity in the digital information accessed by a user during a task in said memory, wherein the relevant data related to said task not stored within said memory relates to the primary entity. Optionally, retrieving said relevant data from an external data source may comprise querying for data related to the primary entity. For efficiency, retrieving said relevant data from an external data source may comprise scraping a website. For efficiency, retrieving said relevant data from an external data source may comprise querying an external application program interface (API).

For interoperability / ease of use, the method may further comprise mapping said relevant data to the input of said API. For ease of use, presenting said relevant data to the user may comprise compiling the relevant data retrieved from the external data source with digital information accessed by a user during a task on said memory. For ease of use, presenting said relevant data to the user may comprise linking said data to data already in said memory. Optionally said data may be linked to data relating to previous tasks.

According to another aspect, there is provided a system for predictive searching of databases, comprising: means for receiving digital information accessed by a user during a task into a memory; means for determining relevant data related to said task not stored within said memory, said relevant data being determined in dependence on said data related to said task stored within said local memory; and means for retrieving said relevant data from an external data source; and means for presenting said relevant data to the user. The invention also provides a computer program or a computer program product for carrying out any of the methods described herein, and/or for embodying any of the apparatus features described herein, and a computer readable medium having stored thereon a program for carrying out any of the methods described herein and/or for embodying any of the apparatus features described herein.

The invention also provides a signal embodying a computer program or a computer program product for carrying out any of the methods described herein, and/or for embodying any of the apparatus features described herein, a method of transmitting such a signal, and a computer product having an operating system which supports a computer program for carrying out the methods described herein and/or for embodying any of the apparatus features described herein.

Any feature in one aspect of the invention may be applied to other aspects of the invention, in any appropriate combination. In particular, method aspects may be applied to apparatus aspects, and vice versa. As used herein, means plus function features may be expressed alternatively in terms of their corresponding structure, such as a suitably programmed processor and associated memory.

Furthermore, features implanted in hardware may generally be implemented in software, and vice versa. Any reference to software and hardware features herein should be construed accordingly. The invention extends to a system and method for automating information retrieval substantially as described herein and/or as illustrated in the accompanying figures.

Instead of crawling the web to index data, instead only the data users discover while performing a task on the web browser is indexed, optionally along with indexing data from previous tasks and/or external data automatically acquired in response to a prediction of the user's desires. Semantic web technologies (e.g. OntoText & Cambridge Semantics) leverage the semantic graph to return conventional search engine results for search terms, rather than attempting to assist with tasks, and do not utilise user behaviour. In contrast, the present invention has no natural language query and retrieves useful information, preferably in the form of one or more first class information entities, rather than URLs. Information is augmented directly into the current webpage being accessed by the user.

Thus, the present invention may be considered to function as a task completion assistant for professionals conducting online research (e.g. knowledge workers such as recruiters, salespeople, investors, academics, analysts, etc.). It saves the researcher time, and provides structure to their research task while promoting best practice and allows organisations / individuals to make the most of the work being done every day by hundreds of millions of workers. In this way, a user may be provided with related information featured on other websites that assists them in their present (e.g. research) task without having to navigate to other websites whereby to collate the information themselves.

Task-oriented products such as Cortana, Google Knowledge Box, for example, do not focus on professional task assistance, nor do they capture and/or represent current work being done across websites/applications. Known technologies provide no ability to customise a search to fulfil a task, nor can the searches be queried against or shared. Sector specific information aggregators such as Entelo, DueDil and FullContact, for example, have no understanding of overall workflow and do not dynamically present information based on information needs.

Task capture software such as ATLAS Recall take periodic screenshots of a user's computer screen, which may include the user's web browsing content, and use optical character recognition to store and index this content using conventional search technology. This content is not collated into tasks and no meaning is ascribed to what the user was doing with that content. Further, the only mechanism of retrieving the content is to perform a natural language search. In summary, the present invention assists users with web-based tasks by keeping track of the work they are currently doing, which information is stored in an easily retrievable format. It uses techniques from natural language processing, knowledge representation and reasoning, deep and reinforcement learning, and dynamic/task based information retrieval to break the sequence of pages a user looks at down to the task level, predict intent, and deliver relevant information around this context based on a personalised knowledge graph.

As used herein, the term "task" preferably connotes a discrete activity performed by a user (or multiple users) for a particular purpose. Tasks may include tasks performed by a user online, such as using social profiles to assess a candidate for recruitment, investigating a company for "know your customer" (KYC) purposes, or reading academic research papers, for example. A task may include at least one of the following actions: using a web browser to perform (e.g. Google) searches; read a webpage, click a link, copy and paste content, complete a form, etc. These actions may occur in a sequence referred to herein as a "workflow".

As used herein, the term "computing device" preferably connotes an electronic device having data input/output capabilities, a processor arranged to run software and a digital display, preferably configured to display said output in graphical form. As used herein, the term "digital information" preferably connotes information that can be managed and retrieved by a computing device, which information is stored (usually electronically) using a series of ones and zeros. As used herein, the term "mechanism" preferably connotes elements of the present invention that perform various operations, functions, and related aspects. As used herein, the term 'first class object' or 'first class entity' preferably connotes an object or entity which supports all mathematical / processing operations. As used herein, the term 'vector' preferably connotes a one dimensional array. As used herein, a vector is preferably a type of first call object. As used herein, the term 'ontology' preferably connotes a description of a domain, where the ontology is made up of a collection of concepts / classes / entities and the properties / relations between such concepts / classes / entities. As used here, the term 'knowledge base' preferably connotes an ontology augmented with a set of rules that allow patterns in the information provided in the knowledge base to be found. As used herein, the term 'knowledge graph' refers to a knowledge base having data organized as a graph and/or implemented using a graph database. As used herein, the terms 'knowledge graph' and 'knowledge base' may be understood to be interchangeable. As will be understood by a skilled person, web browser extensions (such as may be used with Google Chrome (RTM), for example) are small software programs that can modify and enhance the functionality of webpages in the web browser. They can be written using web technologies such as HTML, JavaScript, and CSS. The web browser extension described herein serves two main purposes: workflow collection and task card display. A task card provides a user with information about their current workflow and task within the web browser extension, as well as optionally previous workflows and/or information predicted to be relevant to the user's future tasks. Any apparatus feature as described herein may be provided as a method feature, and vice versa. Furthermore, as used herein, means plus function features may be expressed alternatively in terms of their corresponding structure.

In the following description and accompanying drawings, corresponding features of different embodiments are, preferably, identified using corresponding reference numerals. At least one exemplary embodiment of the present invention will now be described with reference to the accompanying figures, in which:

Figure 1 shows an exemplary system according to the present invention;

Figure 2 shows a system overview in more detail;

Figures 3A and 3B show the system architecture;

Figure 3C shows a flow diagram showing the steps of a computer-implemented method of processing information during a user-performed task for use with the system;

Figure 3D shows a flow diagram showing the steps of a computer-implemented method of classifying a sequence of user accessed websites in accordance with user-performed tasks for use with the system;

Figure 3E shows a flow diagram showing the steps of a computer-implemented method of predictive searching of at least one information source for use with the system;

Figure 4 illustrates the knowledge / workflow representation aspect in more detail;

Figure 5 shows an example of domain taxonomy;

Figure 6 shows the taxonomy of Figure 5 with relationships shown;

Figure 7 shows a map of classified websites created for classifying a user workflow;

Figure 8 shows a neural network classifying a user workflow;

Figure 9 illustrates the vectorising of a document;

Figure 10 shows an exemplary task card;

Figure 1 1 shows a graphical knowledge base that provides information for the task card;

Figure 12 shows an example of a Question Answering mechanism; Figure 13 shows a history of previous tasks;

Figures 14A and 14B show an example of how relevant information can be captured and presented to a user during a task;

Figures 15A and 15B show an example of a knowledge graph showing previous stored information relating to the task of Figure 12A, and the presentation of stored information; Figures 16A and 16B show how the stored information illustrated in Figure 13A may be retrieved for automated population of text fields;

Figure 17 shows another example of how relevant information can be captured and presented to a user during a task;

Figure 18 shows an example of the stored information being presented to a user via a mobile computing device;

Figure 19 shows a schematic representation of a present graph;

Figure 20 shows a schematic representation of a past graph;

Figure 21 shows a pipeline for constructing the past entity search;

Figure 22 shows a schematic representation of a future graph;

Figure 23 shows a schematic representation of a super graph;

Figure 24 shows a schematic representation of a data wrapper for importing data from third parties into the system;

Figure 25 shows the flows of information through the system;

Figure 26 shows the schematic operation of a task manager of the system and associated components;

Figure 27 shows the architecture of an integration description language (IDL);

Figure 28 shows a graph of pages visited by a user;

Figure 29 shows a vector transformation of the graph of Figure 28;

Figure 30 shows the architecture of a neural net for predicting the user related to the graph of Figure 28; and

Figure 31 shows schematic hardware components configured to implement the described system. Figure 1 presents an exemplary system 100 according to the present invention in which a user is accessing the internet via a web browser running on a computing device. A web browser extension, running on the web browser, monitors the webpages accessed by the user, retrieves certain information from the accessed webpages and communicates information relating to the accessed webpages between the web browser and a separate data processing system comprising a processor and a memory. Knowledge workers typically use the internet (or "web") to complete professional tasks, which may involve performing multiple searches, manual information consolidation, and translational effort in moving/sharing completed work between formats, applications, and people. This is inefficient and relies on the time, skill, and memory of the worker.

The present system takes, as input, current information (e.g. HTML content, URL) from a webpage being accessed by a user and, optionally, user actions (e.g. mouse hovers, clicks, drags, etc.). Facts (e.g. entities + relations) are then derived from the current information. The present task being performed by the user is then classified in the context of both the current information and previous information received relating to that user (e.g. company research, candidate research, technical question answering, etc.).

Once the facts are established, the "work" and "knowledge" may be represented in graphical format where tasks, entities, and relationships accessed and viewed by the user are represented as "first class objects" that can be queried and logically reasoned. The representation of work and knowledge in graphical format facilitates a generic query mechanism. In particular, a body of text (e.g. HTML content) can easily be associated with a piece of work (e.g. a task performed by a user), and thus the knowledge found during that process. The process involves receiving a request from the user and automatically formulating a query against the (knowledge) graph using a translation/query layer.

More specifically, any text (such as a web search query or HTML webpage content) is described using a vector. Likewise, a workflow or task is also described as a vector. The vector represents the 'sentiment' of the text/graph/task (i.e. generalized information related to the text/graph/task that the vector is describing, such as the subject of the text/graph/task, thereby to allow classification to take place) For example, the vector may be "a vector/information about recruitment", or "a vector/information about cinema tickets". As will be described later on, machine learning is used to associate these vectors. This may allow a vector representing a description of a task to be found to be similar to a vector of a graph that describes the information in that task.

Figure 2 shows a more detailed overview of the system. A web browser extension 1 may be used to capture digital information (e.g. the content, URL, etc.) from an information source such as a web service (such as a website) 2 accessed by a user performing a task. Captured digital information is sent (γ) to the knowledge extractor 3 and sent (δ) to the workflow extractor 4. The knowledge extractor 3 returns all the entities and relations found within the content of the webpage 2 just captured. The workflow extractor 4 creates or updates user workflows (i.e. a sequence of actions forming a task). Extracted information is stored in graphical form (i.e. a knowledge graph or knowledge base) in a data store (not shown). Extracted knowledge (in particular knowledge that relates to the current workflow) is presented to the user via an output 5, such as a task card which is part of a user-interface. If the user is not currently viewing the task card, the browser extension may push the current task card towards the user (e.g. by displaying the task card as a 'pop-up' on the user interface), for example to notify the user that updated information is available to view. The user may pose questions in (pseudo) natural language about the task card 5 in a query field 6 of the task card 5. A translational parser converts the question into a graph query against the current subgraph. The answer presented to the user consists in any matching data in the data store and is appended to the current task card 5. The user may review the task card of previous workflows 7, and may provide search criteria to filter the list of previous task cards. The information presented to the user is then updated accordingly.

Information retrieval

As illustrated in Figure 3, the information retrieval mechanism (using the knowledge extractor 3 and workflow extractor 4) comprises three main aspects: i. Information capture 200 - the retrieval of information from a webpage being accessed by a user, for example the (e.g. HTML) content and URL, preferably together with the actions (e.g. mouse clicks, hovers, etc.) taken by the user when browsing the webpage; ii. Work and knowledge representation 300 - the task is represented in terms of the knowledge found during performance of the task, which representation may utilise knowledge graphing and/or neural graph embedding of workflows; and iii. Work assistance 400 - the work representation can be queried in a generic way, which may allow applications to be built that can assist the user, for example.

Figures 3A and 3B show two complementary illustrations of the system 100, where Figure 3A shows a component view and Figure 3B illustrates an implementation. As examples, the knowledge/work representation 300 may be implemented as a knowledge base graph, as in Figure 3, where the links between each user are illustrated. Similarly, the question answering and fact recommendation components of the work assistance 400 may be implemented using an application sidebar on a web page, as in Figure 3B. Work assistance 400 may also or alternatively take the form of an export to a further service, such as Google (RTM) Sheets, a web based customer relationship management (CRM) software, or custom CRM software. The web services 2 used may include web pages, Github (RTM), Lusha (RTM), custom databases, or other third party software.

Figure 3C shows a flow diagram showing the steps of a computer-implemented method 10 of processing information during a user-performed task for use with the system 100, which makes use of the aspects mentioned above. In a first step 12, information is extracted from at least one information source accessed by a user during a task (i.e. the information capture 200 aspect is used). In a second step 14, relevant information (in particular, at least one of an entity and a property associated with an entity) is identified from said extracted information. In a third step 16, the identified relevant information is associated with a stored database of entities and properties (i.e. the knowledge graph) thereby to update the database. It will be appreciated that the second and third steps together make use of the work and knowledge representation 300 aspect. In a fourth step 18, in response to a user query related to a particular entity, information relevant to the particular entity is extracted from the database. In a fifth step 19, said information relevant to the particular entity is provided to the user. It will be appreciated that the fourth and fifth steps together make use of the work assistance 400 aspect described herein.

Figure 3D shows a flow diagram showing the steps of a computer-implemented method 20 of classifying a sequence of user accessed websites in accordance with user-performed tasks for use with the system 100. Classifying sequences of websites (or other information sources) accessed by the user may be useful in distinguishing discrete tasks from each other. In a first step 22, a sequence of user accessed websites is received. In a second step 24, sub-sequences of user accessed websites are classified as particular user-performed tasks using a trained classifier.

Figure 3E shows a flow diagram showing the steps of a computer-implemented method 30 of predictive searching of at least one information source for use with the system 100. Predictively searching for information of relevance may improve the utility of a database (for a particular task) by incorporating relevant information without a user's specific input. In a first step 32, information from at least one information source accessed by a user during a task is extracted into a database. In a second step 34, further information that is likely to be of relevance to the task is identified using the information in the database. This further information is not included in the information in the database. In a third step 36, the further information is extracted from at least one information source into the database. In a fourth step 38, the further information is presented to the user. The various aspects of the invention will now be explained in more detail, as follows: i. Information capture (200)

Starting with the information capture 200 aspect, structured information may be extracted from semi-structured and unstructured information, thus enabling the system to detect relevant information in various different websites in accordance with a predefined ontology (as described in more detail further on in relation to Figures 5 and 6).

A mapping mechanism may be provided for extracting certain information from the HTML structure of certain webpages, by creating a template for a given webpage in which all the different parts of the webpage are tagged such that when a user accesses that webpage the relevant information in those tagged parts can easily be identified and extracted by the mapping mechanism. For example, a web browser extension may be created that allows different HTML objects in a given webpage to be tagged to represent different entity types, and, optionally, also to associate certain relations between the tagged entities. Once a webpage is tagged, the web browser extension may be used as the mapping mechanism that collects the entities and relations from any webpage having the same structure.

The mapping mechanism can, however, only be utilized as long as there are webpages which have the exact same HTML template; when a webpage is accessed for which a template does not exist (e.g. which webpage has not been mapped), then only free text may be extracted from that webpage. To extract entities and relations from the free text, Named Entity Recognizers (NERs) may be utilized. Such NERs are models which find entities based on textual context and patterns in free text. For example, a relation may be extracted if two or more entities are found in the same sentence along with a few other restrictions such as that the distance between the entities cannot exceed a certain limit.

The extracted entities and relations are all allocated a confidence ranking. After the entities are extracted, an additional step is used to verify the entity. This step could be, for instance, checking that an entity of type "Person" is present in a database of person names. If an entity is verified then there is an increase in the confidence of that entity. The entities and relations are sent to the database along with their entity types and confidences. ii. Work and knowledge representation (300)

Figure 4 shows the knowledge/workflow representation 300 aspect of Figure 3 in more detail, in particular showing the general conceptual model used to represent users, workflows/tasks and the information in those workflows as a graph. An exemplary summary of the concepts (i.e. the nodes in the graph) and the relationships between them (e.g. the directed edges in the graph) is provided. Each concept is independent from other concepts (except for MANAGER that is also USER). Each node is accompanied by a curly bracket in which are listed common properties that may be tracked for that concept. Where two nodes are connected by an edge, a relationship exists between those concepts. Edges have names (and sometimes also properties); for instance: workflows have an ident, an initial timestamp and a final timestamp; workflows belong to users (which have an ident, an anon_key, a username and a text), relate to a category (which has a text) and include pages (which have an event_ident, a timestamp, a page_ident, a URL, a domain and a title).

Certain actions trigger the creation, update or fetching of instances of these concepts (specific individual, i.e. a user), which retain all the properties and relationships described so far, as will be described further on. Many specific types of entities and relations (shown on the far right of the diagram) exist and are omitted from this picture for brevity.

Figure 5 shows the many specific types of entities and relationships of Figure 4 organised as a domain taxonomy (or ontology), which in particular shows the entity hierarchy. All of the related are IS_CHILD_OF and represent specializations of entity types i.e. an ADVISOR is a specialization of a PERSON, which is a specialization of an INDIVIDUAL. Structured extraction finds entities and relations of specified type on webpages in specific domains (e.g. parts of the website), and the mapping mechanism that allows such extraction is organised into a taxonomy of entity types. In such data structure a child entity is a more specific concept than the parent entity (IS_CHILD_OF relation); it retains all the features of the parent type and possibly adds more. All the entity types inherits the properties "text", "confidence", "surface form" and "source" from the base ENTITY type. A child entity may inherit from different parent entities and, generally, sibling entities are not disjoint (e.g. a PERSON might be an ADVISOR and a DIRECTOR), but can be made explicit by means of the DISJOINT_WITH relation (e.g. an INDIVIDUAL is either a PERSON or a COMPANY). The entity types shown in Figure 5 are from the recruitment domain, but it will be appreciated that many other domains (having different or adapted entities) may alternatively or additionally be used. Figure 6 shows the taxonomy of Figure 5, showing the conceptual relations that can exist between entities. For example, an ADVISOR has a ADVISES relation to a COMPANY entity. Relations extracted via structured extraction can be applied on the same taxonomy graph to explicit the relations between entities. If a relationship exists between two entities, all the child entities of the tail entity (the entity from where the relationship goes out) potentially retain the same relationship toward any child entity of the head entity (the entity where the relationship comes in) unless otherwise specified (i.e. an INDIVIDUAL might INVESTSJN a COMPANY, but also a COMPANY might INVESTSJN an ACADEMY because COMPANY is a subclass of INDIVIDUAL and ACADEMY a subclass of COMPANY. Unstructured extraction might find more relationships by parsing the FREE TEXT found in the pages in the given domain; if so, these relationships are added to the network of relationships in the previous picture. It should also be noted that unstructured extraction will eventually build a universal taxonomy of concepts and relations that will bridge over domains. Domains can coexist, too, so depending on the needs of a user, more taxonomies might be merged in a higher taxonomy to address these needs.

With regard to the workflow extractor 4, Figure 7 represents a map of websites, in which each accessed website is mapped to a vector (thereby creating document embedding vectors), whereby to create the map of websites for use in determining a user's workflow. A particular task 11 is represented as a path on the map of websites, showing a user's progress between websites. The points indicated on the path (labelled 1 , 2, 3, 4) represent a sequence of visited websites. Referring to Figure 8, there is shown a Recurrent Neural Network (RNN) that is used to classify document, e.g. websites, represented in vector format into a predefined community (e.g. label). The RNN is trained to classify sequences into predetermined communities and to determine how well a sequence belongs to a particular classification (for example by evaluating a confidence level). The sequence of all visited websites is thereby divided into sub-sequences that remain within a community boundary. The communities found may be used as labels for the tasks, as indicated in the 'key' for Figure 7, for example.

In training, the RNN receives an input of labelled sequence of websites (i.e. vectors representing the websites themselves, as well as the order in which the user accesses the websites) to build an internal representation of the vectors. The final hidden state of the RNN is projected onto a "workflow embeddings layer" (i.e. a characterization of a workflow in terms of websites visited). This last layer is then projected onto the classification vector thereby to classify the websites into e.g. classes A, B, C, D... The classification vector is a list of probabilities stating the confidence that the sequence belongs to a specific class. The RNN thereby learns a representation for the entire sequence (the workflow embedding, which can be compared for similarity with other workflow embeddings by using the dot product), and a classification for a particular sequence (according to the predefined communities).

Referring to Figure 9, there is shown the process of vectorising and classifying websites. Websites (documents and text) are transformed from HTML into vector representations, which are passed to the RNN and used to classify the websites related to the vector representations.

The incoming sequences are provided continuously (i.e. a continuous input of websites visited is fed into the RNN). The RNN breaks sequences into subsequences and/or joins subsequences as appropriate in order to improve classification quality. An output of the RNN is a classification vector for a subsequence and/or a particular website. The perplexity of the classification vector can be taken as a measure of the classification quality.

The detection of a start and/or end of a workflow may trigger further actions in the system. For example, when a workflow is ended and/or a different workflow started, the task card/knowledge base for the recently ended workflow may be completed and processed for later retrieval. iii. Work assistance (400)

Figure 10 shows one possible output of the system for work assistance, which is in the form of a "task card" user-interface providing information retrieved from the data processing system that relates to a current task being performed on a web browser by the user. The task card is part of the web browser extension, which retrieves information from webpages accessed by the user and transmits that information to the data processing system. The information presented in the example shown represents a Company Research task performed in respect of 'University of Town' The task card provides information on persons of interest related to University of Town, information on their expertise, location information, linked organizations, and skills.

The task card in Figure 10 provides the following information: • What the task was about (University of Town in this case)

• What type of task it was (Company research)

• The most important information items in the task categorised by type (John Smith as a person, Machine Learning as a sector, Europe as a location, etc.)

· Sources of data from which information items are acquired (e.g. logos of websites for which a mapping mechanism exists) (not shown in Figure 10).

• An indication of one or more prominent missing information items (e.g. when an email address is not found, a message 'email address not yet found' may be displayed) (not shown in Figure 0).

The user is also able to add notes to the task card manually or highlight text on any page and add it to the task card.

A screenshot of an instance of the taxonomy populated with the data from a workflow (i.e. Company Research) about University of Town is shown in Figure 1 1. This interface allows a user to view a representation of the relevant knowledge graph, and therefore may be referred to as a 'graph viewer'. This knowledge graph shows the complexity of numerous entities and relationships identified and captured while performing the workflow. Every information entity extracted from each webpage is shown, together with all of the relations between each entity with the page and each other. This is a simplified visualization of the knowledge base graph structure. Entity types (e.g. person, company, location, etc.) can be filtered in this view. Also, every entity has a 'score' that represents the confidence that it is accurate, its connectedness in the workflow and how prevalent it is across all workflows. Entities can also be filtered by this score. The complexity of workflows is clearly demonstrated by the graph viewer.

The knowledge base can be interrogated by means of an open interface in pseudo-natural language that assists the user in building the questions that they want to ask to the system, as shown in Figure 12. The information retrieval mechanism is triggered (1) when a certain keyword is typed into a search query input box (e.g. "what", "which", etc.) (2). The mechanism reads all the entity types (e.g. "artefact", "company" and "individual") from the taxonomy and populates a (preferably, pop-up) menu of options (3) from which the user can select the main concept of the question. The mechanism then retrieves all the relations and entity types that can be reached from the current entity and organises them in a menu (4) from where the user can select how to continue the question. The previous target entity ("invests in academy") then becomes the current entity (6). This step can be executed zero or more times.

In addition to the list of relations and entities reachable from the current entity, the menu also contains a special item "with text... " (5) after which the user can specify the text associated with the entity instance of interest (e.g. "University of Town") (6).

When the user eventually types a special terminator (e.g. "?") or simply hits the "return" key, the question built so far is passed as input to a parser that utilises a bespoke translational grammar built automatically by the taxonomy to convert it into a database (DB) query. If the parser detects a semantical error, it prompts possible corrections to the user and waits for further input; otherwise, the mechanism formulates answers as output, as described further on. It may be noted that the menu is also optimised by looking at the data currently available in the search space; if any specific relation/individual couple is available from the current position its item is removed from the menu.

The concepts in which the entities gathered by a user are organised are also used to index the instances in those classes for retrieval purposes. The search might be bounded to the user's current unit of work, the user's past work, or all the past work done by the user's team. The data that match the search criteria is sorted by team (if applicable), unit of work (if applicable) and by centrality/TF-IDF to be ranked. The most relevant data (if any) is added to the output (e.g. task-card). If the graph does not contain yet an answer, a placeholder is added to the output, which will be automatically replaced by the answer when it becomes available.

Figure 13 shows a dashboard user-interface comprising a collection of task cards (representing different or overlapping knowledge graphs) a user has accumulated while using the system. A timeline shows the websites and searches made during the workflow. Users may organise their task cards into folders / projects / task types. A search interface allows a user to find a specific task card using key words.

The dashboard may also be used to access knowledge graphs corresponding to task cards. All created knowledge graphs are stored into a database for retrieval at a later date. Figures 14A and 14B illustrate how information may be captured during a user-performed task, and that information used to enhance the task. In the example shown, the user has accessed a particular webpage that lists certain information about the person who is the subject of that particular workflow (e.g. a research task for potential recruitment). In Figure 14A, it can be seen that the system (e.g. the web browser extension) has previously identified and captured the job title (CTO) and company (Acme) relating to the subject person, which information is now presented in the task card. The user has identified that it wishes to capture information relating to the subject person's education, and has therefore entered the term "school" into the search field presented in the task card. In Figure 14B it can be seen that the system has identified from the webpage accessed by the user that the subject person attended the "Ocean Uni", which information has been captured and stored in the knowledge base that has been created for the subject person. Furthermore, the system has returned that information to the task card, which information is now displayed to the user.

Figure 15A shows a knowledge graph comprising the captured information on the various entities and relations relevant to the subject person in Figures 15A and 15B. It can be seen that there are three main entities, about which the other entities are interconnected. The task card in Figure 15B provides a convenient user-interface for a user to be presented with information, which may be previous information stored from a previous task that is retrieved if it is relevant to a current task being performed by the user. Figures 16A and 16B shows how the stored information that is captured during a user- performed task may conveniently be used to autofill text fields. In this example, the system is being employed in a recruitment application. In Figure 16A, the system has identified information from the webpage accessed by the user, where the information relates to the required fields, and has retrieved the required information and presented it in the task card. In Figure 16B, an autofill function on the task card has been used to complete the required fields in the form using the information presented in the task card.

Figure 17 shows another example of the system capturing information from a webpage accessed by the user, this time the information being a telephone number captured from an email and sent to the server, in addition to being presented in the (updated) task card.

Figure 18 shows an example of a mobile computing device on which a user has accessed the task card of Figure 17. The mobile computing device is also a mobile telephone device, which is now presenting the user with the option to reach the subject person by calling the telephone number previously captured from the email (in Figure 17). Other possible applications (i.e. the work assistance 400 aspect) of the system include (but are not limited to) the following:

• Question Answering on previous work (e.g. "who was the founder of company researched yesterday?")

• Task/Work management to manipulate and organise the work being captured (e.g. being notified of the previous work performed by a user, or a colleague for example, when starting a similar task or event)

• Fact recommendation to assist with task completion (e.g. auto-filling the information found into a task into a form, email, content management system/database, report, etc.)

• Action recommendation to suggest actions based on the current work (e.g. send an email to a person being researched)

• Service Arbitrage based on our understanding of the task being performed and information present; potentially relevant internal or third party services can be suggested (e.g. while researching Microsoft, a user could find its stock price on a third party financial information provider). Optionally, the system may be arranged to communicate with a third party service thereby to receive input data and/or provide output data.

Related-entity search/Predictive search engine

Another aspect of the invention is the provision of a related-entity search / predictive search engine. As previously described, as information is captured throughout a user's workflow, a knowledge graph is built by the system, consisting of entities and relations from at least one web page. This knowledge graph represents a summary of the "present" task and information need, and so may be referred to as a 'present graph'.

Figure 19 shows a schematic representation of a present graph 170, where the nodes represent entities. Within the current workflow, a primary entity 171 (or entities) can be determined. This is an entity representative of the whole task and given relative importance amongst the other entities. The primary entity can be determined by one or more of:

• Finding entities in Google search queries within workflows

· Looking for highly connected and weighted entities in the workflow • Receiving an indication from the user (for instance by clicking on the entity, typing it on a webpage etc.)

In order to provide further information relevant to a user's current task, the user's and/or the user's teams previous workflows may be accessed. For example, with reference to the example shown in Figures 10-18, a user may have done some research on James Smith, and then a week later they do some research on his employer, ACME. While researching ACME, it would be helpful for the user to be reminded of what they previously learned about James Smith and also show how it links to the current information about ACME.

Figure 20 shows a schematic representation of a 'past graph' 180 made up of knowledge graphs from several workflows having a common entity 171. The determined primary entity is used to find workflow graphs in the user's workflow history. The user's workflow history is queried to determine whether the primary entity is present. If the primary entity is present in any of the past workflows, the graphs for those workflows are retrieved. These graphs are then aggregated with the present graph to form a wider graph, which may be referred to as the 'past graph'. The past graph contains all of the entities that have previously been found to have some association with the primary entity. In an alternative, other methods may be used for determining related workflows other than using a primary entity, for example, a measure of the similarity of workflows may be used, or the websites contained in the workflows may be compared.

Figure 21 shows a pipeline for constructing the past entity search using the components of the system described with reference to Figure 2. The pipeline consists of the following steps: i. Extract entities from the current webpage 2 using the knowledge extractor 3 and place them into the current workflow (or present graph) ii. If there is a primary entity 171 (such as a Google search query), extract it and extract any entities from it. In the example shown in Figure 21 , the primary entity is 'John Smith'. iii. Using the primary entity, search the user's knowledge graph of previous workflows and find a list of all workflows the user has previously created containing the primary entity. Combine this information into a past graph. iv. Once we know which workflows the primary entity has belonged to in the past, use the RNN vector space of the workflow extractor 4 (as previously described with reference to the 'workflow extraction' aspects) to measure the similarity of each of those workflows with the current workflow. v. At this point, a list of entities is found from the present and past graphs. The entities are then weighted according to a number of criteria:

The weight of the primary entity within the present graph

The confidence that the system understands what the primary entity is

The weight of the primary entity in the past graph workflows

The weight of every other entity within its workflow graph

The similarity of each historical workflow with the current workflow

Optionally, time since the each or each historical workflow has taken place vi. In combination, these weights provide entities that are relevant to the current task, related to the primary entity, have a relatively high degree of confidence, and are contextually relevant to the user. The entities are ranked using this combined weight and then the highest ranking entities are returned to the user, for example via the browser extension 5. For example, the browser extension may display a message indicating that a user's colleague has performed research about the primary entity or about another primary entity within the last week.

In an example, the past graph may be used to provide USER-based or TEAM-based recommendations. Given current WORKFLOW, the system identifies the CATEGORY and the ENTITIES that are closest to the primary ENTITY of the WORKFLOW. The system finds other WORKFLOWS (from the user, or from TEAM members) that PERTAIN to the same CATEGORY and/or are about any of the close ENTITIES identified above. This information is ranked by the number of connections to the initial WORKFLOW. The information ordered by relevance may be displayed on a side panel to give the user a selection of data with which to complete the current task or to suggest new actions.

Figure 22 shows a schematic representation of a 'future graph' 200 made up of entities found from 3rd party data sources such as the web, an API or a database. Such data sources may be 'scraped' (i.e. data is extracted from human-readable output) to acquire the entities for use in a future graph, as will be described later on. Like with the past graph, the primary entity is used as a way to find entities for the future graph, or other mechanisms depending on the data source and the information in the current graph. Data is acquired from third party sources using a service layer (which may also be used to provide context- relevant information for work assistance, as previously described) configured to access such sources. Example data sources include GitHub (RTM) and Google (RTM) Docs.

To summarise the functionality of the future graph, entities from the user's current task are used to determine what the user will search for next, those searches are actively performed on their behalf, relevant information is found, collated, and served it to the user. The future graph thereby acts as a kind of predictive search engine, requiring no active user input.

This 'predictive searching' capability is accomplished by initially understanding what the user's current task is, and hypothesising what the next task will be. For instance, if the user's current search query has already been satisfied, the next query can be estimated. The entities in the current and previous workflows can be used to make an accurate guess, and the current task type can be used to guess the next step. For example, if the user is performing a recruitment task then it is likely that the user will want information from Linkedln (RTM) next. As the system knows what information the user has found so far, how it connects to the user's history of tasks, see which pieces of information are the most important in this task (using the past search functionality described above) and use that as a basis to search over Linkedln for more information.

Once the type of task and the most important entities are determined, a pre-emptive search can be performed. At least three types of data source will be used: a API - Many web services have an API. They usually take some input and return data from that service. The most relevant entities from a user's workflow are extracted, matched to the inputs of the service, and data is retrieved from the service.

b CRM (Customer Relationship Management)/Database - In this case, the relevant entities are transformed into an appropriate structured query which is sent to the service.

c. Scraped website - We can also inform the user's browser to load up the webpage we believe they're going to navigate to next and then automatically scrape the content and extract entities from it in the same way we currently do. Once data (which is usually already structured) is received from the 3rd party source, it is converted into entities on our ontology (as previously described) and added to the future graph.

Figure 23 shows a schematic representation of a 'super graph' 210. The previously described graphs are combined to form a super graph, which is made up of all of the information related to the current task (and primary entity). By incorporating the previously mentioned graphs, the information in the super graph comes from the current tasks, tasks that the user has done in the past, and information they may wish to find in the future. In theory, this graph should contain all of the information that the user will need given the context of their current task. The super graph is constructed so that we can perform entity search. A super graph may contain thousands of entities, many of which are only tangentially related to the task. To resolve this, the entities are weighted and ranked so that we can determine which of the entities are contextually the most related to the current task (and/or primary entity) and use them accordingly.

Connecting this data to the super graph allow the entities from the future graph that are relevant to the user at the present moment to be identified (allowing them to be presented to the user contextually), and also helps to disambiguate entities and remove noise or errors. These new entities will also be weighted and can be compared and ranked against those already in the graph. This means that the user will be presented with a set of entities from their current workflow, past workflows and future workflow, in one setting. In general, the aim of the described present, past, future, and super graphs is to is to serve this information to the user in different ways, for example, by displaying the entities, helping the user to complete documents/emails, aid in web browsing etc. Importantly, such information is presented contextually (i.e. in response to user's activities, without a specific user input). The information may also be accessed in response to a specific user request, for example via a question in natural language format (as described earlier). The information is generally presented as part of a task card (as previously described) or other user interface element.

Third party data source wrapper

Figure 24 shows a schematic representation of a data wrapper for importing data from third parties into the system. The wrapper comprises an input script to allow the system to query a third party API based on identified relevant entities in the knowledge base. The third part API may then export data to the system via an output script of the wrapper, thereby adding new entities into a knowledge base. The wrapper is thereby a translation layer from the context of a user (their task and history) to a service a third party can provide.

Data Dashboard

The taxonomy can guide the automatic collection of statistics, including (but not limited to) the following:

· average workflow length in terms of pages or entity instances

• how many workflows per user or per team

• how many entity instances per domain

• how many entity instances per entity type

• how many relations per couple of entity types or per specific instance

· most frequent entity instances or relations

These statistics can be gathered and displayed in an overview page that demonstrate the size and potential of the data inside the knowledge base. These statistics might also be used to dynamically asses the likelihood that some information is correlated with the current context. These statistics can also be used backwards to guess what kind of information is more likely needed given the current context, which can be used as an input for the predictive searching/future graph aspects as previously described.

Task Manager and Integration Description Language

Figure 25 shows information flows into and out of the system 100. The core of the system is shown as a "service system" 261. At the user side, the service system receives user actions and a stream of events and documents, and provides an assistive response. The system 100 supplies document text to an information extraction component 262 and receives corresponding extracted information. Such information, together with detected user events, may be supplied to a task manager 263, which identifies tasks (as previously described) and creates "task slots", as will be described. The service system 261 is also configured automatically to receive input from third parties 266 via the internet 265 and a service integration component 264 using its "predictive searching" capabilities, as previously described. The system 100 described herein is accordingly capable of operating as a Service Arbitrage layer between the user and third party services they already use or can use. In this way, the system automatically manages queries to relevant 3rd party services that it estimates will help resolve the user's intent.

In more detail, the system 100 is arranged to examine both the user's document set and actions taken to determine which, if any, tasks can be understood as such. If the combination of documents and actions being undertaken can be understood by the system as a task then it can begin servicing the user's intent. It does so by translating that task information into requests that can be understood by third parties.

Ultimately, the system 100 provides an assistive response to the user that can be, but is not limited to, a combination of information the system has organised/inferred, information from third parties, and actions that can be taken on third parties through the system.

At the core of the system service 261 is the Task Manager 263. It is the component of the system that decides task boundaries using all information and signals available. This includes the semantic information extracted by the information extraction 262 from documents/websites and both the implicit and explicit actions the user takes when interacting with that information. If this combination of information and interactions can be understood by the system then the task manager 263 creates a task slot. Task slots represent defined tasks that the system can service request for (e.g. a recruitment task about John Smith). Once it has been determined that a serviceable task is underway, the system is able to act on both internal data and that from third party data providers 266. In order to connect to a third party a service integration 264 component is used which connects to the third party service providers 266 via the internet 265. More specifically, a Service Data Transformation is created. This implies the ability to transform data between the internal representation of a task and the format required by the third party. It also implies a degree of Service Discovery, the capability to determine which services a given third party exposes can be used for the task at hand (e.g. the github service is queried if the user requires programming information).

Referring to Figure 26, the operation of the task manager 263 (and associated components) is shown in more detail. With reference to the numerals within Figure 25, the processes implemented, and the flows of data, comprise: i. An incoming data stream (documents + events) is transmitted from users to a load balancer 251.

ii. Requests are load balanced and then transmitted to an interface service pool 252.

iii. A web-socket channel is created and stored in a first database 253.

iv. A tracking event job is created and stored on a second database 254.

v. Each job is queued for extraction processing on the task queues 255.

vi. The job is transferred to the extraction consumer pool 256.

vii. Extraction workers use monolithic extraction services (such as NLTK, Spacy, and Gregory) 257.

viii. Extractor services return entities to the extraction consumer pool 256.

ix. Extractor worker stores triples in a third (graph) database 258.

x. Each job is queued for task extraction.

xi. Task Extractors pull each job from the task queues 255 and determine serviceability on the task extraction consumer pool 259.

xii. Task Extractors transform task data and query third parties 260.

xiii. Third party responses are collected and transformed to triple format on the task extraction consumer pool 259.

xiv. Assistive response is returned to the user from the load balancer 251.

Integration Description Language

To standardise the declaration of format transformations and service discovery there is provided an integration description language (IDL) 270 for all services. The IDL programmatically describes how a third party's data can be transformed to an internal format and vice-a-versa. It also allows the suitability of a given service endpoint in resolving a task slot to be determined. The overall class architecture is shown in Figure 27.

The IDL comprises: an IntegrationClient 271 ; a ServiceSubscription 272; a ServiceSubscriptionFactory 273; a DataService 274; a ServiceTransform 275; a DataRepository 276; a TokenBearerServiceSubscription 277; and a OauthServiceSubscription 278. These classes are arranged to communicate as illustrated in Figure 26. The operation of each component is further described below. IntegrationClient 271 : In order to arbitrate requests on a user's behalf, services often require brokers to identify/authenticate themselves during service requests. To do so they transmit credentials to the system that are transmitted when making requests. Such credentials, along with general configuration are stored in this class.

ServiceSubscription 272: In order to arbitrate requests on a user's behalf users are required to authenticate themselves through a given service. OAuth 278 and token 277 authentication are the two primary ways of doing so and each results in a second set of credentials that the system can use during requests on behalf of the user. The concrete inheriting classes store these secondary credentials. ServiceSubscriptionFactory 273: Mainly used to decouple authenticating logic from service specific logic and to generate an authenticated service for a specific task slot.

DataService 274: This is the implementing interface for all services. It contains all logic for querying a given service. The structure follows the commonplace CRUD (create, read, update, delete) framework, with the repository specific methods being implemented in the DataRepository class 276. Because all services at least implement the read method there is a common interface for querying all sources a user is registered to at runtime.

ServiceTransform 275: This class provides a description of how data for a given service can be transformed to and from graph. Implementing classes are able to use the ontology to define how the service format can be changed to triple format. This allows for storage on the graph, and thus, the deduplication, disambiguation, and reasoning of data coming from third party services. Service Discovery: The naming of implementing classes of the DataService 274/DataRepository 275 interface allows for services to declare what task slots they can fill. For instance, if a given repository, say Companies House, contains data about companies, then an implementing class would be called CompaniesHousePeopleRepository. In this way a given task slot can be dynamically associated to a task slot.

In some embodiments, definitions created in ServiceTransform 275 classes are strictly checked. This allows for programmatic definitions to be generated or inferred. This significantly reduces development time as suggestions and changes can be done automatically. In some of these embodiments, the system self adapts in time. In some embodiments, the service discovery mechanism works for a set of pre-defined tasks, however is inflexible to creating new task slots over time. For this a more nuanced mechanism of dynamic association may be used. Fingerprinting

Figures 28 shows a graph of page visits for a user. Each time a user visits pages, their activity is recorded as a graph on the knowledge base 300. This graph of user activity is useful for characterising user behavior.

Figure 29 shows a vector transformation ('fingerprint') of the graph of Figure 29. The vectors within this transformation preserve the distance between graphs: similar graphs having vectors close to each other. To improve performance an additional dummy node 291 (with text "TOP") is added to the graph that is connected to all the other nodes.

The similarity between graphs is computed by making a neural net predict the user that generated a specific graph. The architecture is shown in Figure 30. The input to this neural net is the Doc2Vec embeddings and the adjacency matrix of the relevant graph and the output is the logits of the user IDs. The graph embeddings can be found in the last layer of the network.

Computer device

Figure 31 shows an example of a computer device suitable for implementing the system 100 (at least in part). The computer device 1000 comprises a processor in the form of a CPU 1002, a communication interface 1004, a memory 1006, storage 1008, removable storage 1010 and a user interface 1012 coupled to one another by a bus 1014. The user interface 1012 comprises a display 1016 and an input/output device, which in this embodiment is a keyboard 1018 and a mouse 1020. In other embodiments, the input/output device comprises a touchscreen.

The CPU 1002 executes instructions, including instructions stored in the memory 1006, the storage 1008 and/or removable storage 1010. The communication interface 1004 is typically an Ethernet network adaptor coupling the bus 1014 to an Ethernet socket. The Ethernet socket is coupled to a network. The memory 1006 stores instructions and other information for use by the CPU 1002. The memory 1006 is the main memory of the computer device 1000. It usually comprises both Random Access Memory (RAM) and Read Only Memory (ROM). The storage 1008 provides mass storage for the computer device 1000. In different implementations, the storage 1008 is an integral storage device in the form of a hard disk device, a flash memory or some other similar solid state memory device, or an array of such devices. The removable storage 1010 provides auxiliary storage for the computer device 1000. In different implementations, the removable storage 1010 is a storage medium for a removable storage device, such as an optical disk, for example a Digital Versatile Disk (DVD), a portable flash drive or some other similar portable solid state memory device, or an array of such devices. In other embodiments, the removable storage 1010 is remote from the computer device 1000, and comprises a network storage device or a cloud-based storage device.

As mentioned, the system 100 is implemented as a computer program product, which is stored, at different stages, in any one of the memory 1006, storage device 1008, and removable storage 1010. The storage of the computer program product is non-transitory, except when instructions included in the computer program product are being executed by the CPU 1002, in which case the instructions are sometimes stored temporarily in the CPU 1002 or memory 1006. It should also be noted that the removable storage 1008 is removable from the computer device 1000, such that the computer program product may be held separately from the computer device 1000 from time to time.

The computer program product may also or alternatively be distributed, such that only certain aspects of the computer program product are stored and/or implemented via the computer device. In a particular example, the user may use the communication interface 1004 to access information sources using the internet, which may be incorporated into a database/graph held in storage. Alternatively, the database/graph may be saved remotely, for example via a "cloud server", in which case the computer device is effectively used as a controller for the system. It will be appreciated that various other computer devices could be used to implement part or all of the system. In a particular example, a user telecommunication device (such as a "smartphone") may be used.

Alternatives and extensions The system may be arranged to dynamically present newly acquired relevant information to the user, as previously mentioned, and in addition by contextually provide current information in response the user's current task/workflow. For example, the data fields that a user sees in a task card may dynamically change in response to the user's current task/workflow.

Although the present invention has generally been described with reference to data received and/or collected by the system being text, it will be appreciated that other types of data may also be used - for example, image data.

Although the present invention has generally been described with reference to acquiring data via the internet or via web services, it will be appreciated that other sources of information may also be used, in particular those proprietary to a particular user or organization, such as internal databases.

Although the present invention (and in particular, the related entity search/predictive search engine aspects) has generally been described with reference to a research ask, particularly in the field of recruitment, it will be appreciated that the invention may be applied to any field in which a user acquires information via the internet and/or one or more databases. For example, the system may be able to assist a user in baking a cake by capturing information related to various alternative recipes and ingredients, reminding the user about previously researched recipes, and predictively suggesting new recipes. It will be understood that the present invention has been described above purely by way of example, and modifications of detail can be made within the scope of the invention. For example, any feature in a particular aspect described herein may be applied to another aspect, in any appropriate combination. It should also be appreciated that particular combinations of the various features described and defined in any aspects described herein can be implemented and/or supplied and/or used independently.

Claims

1. A computer-implemented method of processing information during a user-performed task, the method comprising:

extracting information from at least one information source accessed by a user during a task;

identifying at least one of an entity and a property associated with an entity from said extracted information;

associating the identified at least one of an entity and a property associated with an entity with a stored database of entities and properties thereby to update the database;

in response to a user query related to a particular entity, extracting information relevant to the particular entity from the database; and

providing said information relevant to the particular entity to the user.

2. A method according to Claim 1 , wherein a task comprises an information gathering task for a particular purpose, using at least one information source.

3. A method according to Claim 1 or 2, wherein the at least one information source is accessed by the user via a network connection, such as via the Internet.

4. A method according to any preceding claim, wherein information is extracted automatically from the accessed at least one information source.

5. A method according to any preceding claim, wherein said property comprises information about said identified entity.

6. A method according to Claim 5, wherein said information about said identified entity comprises: a location, contact details, a skill, a role, a sector, an investment, or a document.

7. A method according to any preceding claim, wherein said property comprises an entity related to said identified entity.

8. A method according to Claim 7, wherein said related entity comprises: a company, a person, a social media profile, a product, or a project.

9. A method according to any preceding claim, wherein associating comprises weighting the entities and/or properties according to the relevance and/or confidence associated with said entities and/or properties.

10. A method according to any preceding claim, wherein the at least one information source comprises at least one webpage.

1 1. A method according to Claim 10, wherein the extracted information from the at least one information source comprises HTML content.

12. A method according to Claim 10 or 1 1 , wherein the extracted information from the at least one information source further comprises the webpage URL.

13. A method according to any of Claims 10 to 12, wherein the extracted information from the at least one information source further comprises actions taken by the user while viewing the webpage.

14. A method according to any of Claims 10 to 12, wherein a sequence of accessed webpages is mapped to a vector, for example thereby creating task workflow embedding vectors.

15. A method according to any preceding claim, further comprising classifying the current task; and comparing a current task against previous tasks stored on the database having a similar classification to identify whether one or more stored previous tasks relate to the current task and/or user.

16. A method according to Claim 15, wherein comparing the current task against previous tasks comprises identifying a primary entity corresponding to the current task, and searching for said primary entity in the stored previous tasks.

17. A method according to Claim 15 or 16, wherein comparing the current task against previous tasks comprises measuring the statistical similarity of the current task and one or more previous tasks, optionally using a trained classifier.

18. A method according to any preceding claim, wherein the information from the at least one information source is extracted by a (for example, plug-in) extension to a web browser and sent to a data processing system.

19. A method according to any preceding claim, wherein each user is identified by an anonymous encrypted key.

20. A method according to any preceding claim, further comprising converting extracted information from the at least one information source in accordance with a predetermined ontology.

21. A method according to any preceding claim, wherein the extracted information from the at least one information source is received as one or more first class objects.

22. A method according to any preceding claim, wherein identifying at least one of an entity and a property associated with an entity comprises comparing the extracted information from the at least one information source against a predetermined mapping.

23. A method according to Claim 22, wherein identifying at least one of an entity and a property associated with an entity comprises using Named Entity Recognizers.

24. A method according to any preceding claim, further comprising allocating at least one of a score and a weighting to said entity identified in the information based on a confidence rating that said entity is accurately identified.

25. A method according to any preceding claim, further comprising identifying at least one entity representative of the current task.

26. A method according to Claim 25, wherein identifying at least one entity representative of said the current task comprises identifying at least one entity having a relatively high number of connections to other entities in the database.

27. A method according to Claim 25 or 26, wherein identifying at least one entity representative of said the current task comprises weighting entities on their relevance to the task.

28. A method according to any of Claims 25 to 27, wherein identifying at least one entity representative of said the current task comprises receiving an indication from the user.

29. A method according to any preceding claim, wherein said information relevant to the particular entity from the database comprises information accessed by a user during a previous task.

30. A method according to Claim 29, wherein the information relevant to the particular entity from the database is received in relation to the user.

31. A method according to Claim 29, wherein the information relevant to the particular entity from the database is received in relation to other users.

32. A method according to any of Claims 29 to 31 , wherein said previous task is selected from a number of previous tasks in dependence on the relevance of the previous task to the current task.

33. A method according to Claim 32, wherein the relevance of said previous task is determined on the basis of a primary entity of said current task being present in the previous task.

34. A method according to Claim 32 or 33, wherein the relevance of said previous task is determined on the basis of connections between primary entities of said tasks in the database.

35. A method according to Claim 32, wherein the relevance of said previous task is determined on the basis of a measure of the similarity of workflows.

36. A method according to Claim 32, wherein the relevance of said previous task is determined on the basis on a comparison of the information sources accessed by the user during each task.

37. A method according to any preceding claim, further comprising classifying said task based on said extracted information from at least one information source.

38. A method according to any preceding claim, further comprising predicting, by a processor, user-desired information based on at least one of: current information from at least one information source and previous information from at least one information source; querying an external data source for external information relevant to the predicted user-desired information; and upon positive determination of external information relevant to the predicted user-desired information, receiving said external information into the database.

39. A method according to any preceding claim, further comprising determining that a task is underway.

40. A method according to Claim 39, further comprising associating the identified at least one of an entity and a property associated with an entity with the current task.

41. A method according to any preceding claim, further comprising associating at least one information source with a particular task.

42. A method according to any preceding claim, wherein the database is a graph database.

43. A method according to any preceding claim, wherein providing said information relevant to the particular entity to the user comprises using a user interface.

44. A computer program product comprising software code adapted, when executed on a data processing apparatus, to perform the steps of the method according to any preceding claim.

45. A user interface, configured to:

extract information from a webpage being accessed by a user performing a task on a web browser;

transmit said extracted information to a data processing system;

receive stored information from the data processing system; and

output said received stored information to said user;

wherein the stored information is continually updated during performance of the task based on the webpages accessed by the user.

46. The user interface of Claim 45, wherein the user interface outputs the digital information within the web browser.

47. The user interface of Claim 46, wherein the output is in the form of a user interface element that is configured to display different digital information according to the type of information, for example web-links, email addresses, free text.

48. The user interface of any of Claims 45 to 47, wherein the user interface comprises a web browser extension arranged to communicate digital information with the web browser.

49. A system for processing information during a user-performed task, the system comprising:

means for extracting information from at least one information source accessed by a user during a task;

means for identifying at least one of an entity and a property associated with an entity from said extracted information;

means for associating the identified at least one of an entity and a property associated with an entity with a stored database of entities and properties thereby to update the database;

means for, in response to a user query related to a particular entity, extracting information relevant to the particular entity from the database; and

means for providing said information relevant to the particular entity to the user.

50. A system operable to perform the method of any of Claims 1 to 43, comprising a computing device in communication with a data processor, wherein the computing device is configured to extract information from at least one information source accessed by a user during a current task and to send the captured information to the data processor.

51. The system of Claim 50, wherein the information from at least one information source accessed by the user is on a webpage, preferably wherein said information from at least one information source comprises at least one of (HTML) content and the URL.

52. The system of Claim 51 , wherein the computing device is configured to allow the user to access the webpage via a web browser.

53. The system of Claim 52, wherein the web browser comprises a (for example, plug-in) extension that is configured to capture the digital information on the web page.

54. The system of any of Claims 49 to 53, wherein the means for providing said information relevant to the particular entity to the user comprises a user interface according to any of Claims 45 to 48.

55. A computer-implemented method of classifying a sequence of user accessed websites in accordance with user-performed tasks, comprising:

receiving a sequence of user accessed websites; and

using a trained classifier, classifying sub-sequences of user accessed websites as particular user-performed tasks.

56. A method according to Claim 55, wherein classifying sub-sequences of user accessed websites as particular user-performed tasks comprises mapping the sequence of user accessed websites to a classification vector.

57. A method according to Claim 56, wherein classifying sub-sequences of user accessed websites as particular user-performed tasks comprises classifying said user-performed task as a particular task in dependence on a confidence level associated with said classification vector.

58. A method according to Claim 56 or 57, further comprising projecting said representation of the task onto a classification vector.

59. A method according to any of Claims 56 to 58, wherein the confidence level associated with said classification vector comprises a measure of prediction accuracy.

60. A method according to Claim 59, wherein said measure of prediction accuracy comprises the perplexity of said classification vector.

61. A method according to any of Claims 56 to 60, wherein said classification vector comprises a list of probabilities of the sequence belonging to a specific class.

62. A method according to any of Claims 55 to 61 , wherein classifying comprises using at least one of: information accessed by a user; and any action the user takes when interacting with that information.

63. A method according to any of Claims 55 to 62, wherein said trained classifier comprises a recurrent neural network.

64. A method according to any of Claims 55 to 63, further comprising training the classifier using a labelled sequence of website vectors as an input, thereby to build a(n internal) representation of the task.

65. A method according to any of Claims 55 to 64, further comprising training the classifier to classify a sequence as belonging to a predefined community, and how well a sequence belongs to a classification.

66. A method according to any of Claims 55 to 65, wherein said sequence of user accessed websites is represented as a vector.

67. A method according to any of Claims 55 to 66, wherein said at least one subsequence is mapped to a classification vector.

68. A method according to any of Claims 55 to 67, wherein sub-sequences of website vectors are iteratively broken or joined to reach an optimal classification quality.

69. A method according to any of Claims 55 to 68, further comprising determining a community of websites, said community comprising one or more webpages relating to a particular category of information.

70. A method according to any of Claims 55 to 69, further comprising determining the start and/or end of said task.

71. A computer program product, comprising software code adapted, when executed on a data processing apparatus, to perform the method according to any of Claims 55 to 70.

72. A system of classifying a sequence of user accessed websites in accordance with user-performed tasks, comprising:

means for receiving a sequence of user accessed websites; and

a trained classifier for classifying sub-sequences of user accessed websites as particular user-performed tasks.

73. A computer-implemented method of predictive searching of at least one information source, comprising:

extracting information from at least one information source accessed by a user during a task into a database;

using the information in the database, identifying further information that is likely to be of relevance to the task, wherein the further information is not included in the information in the database;

extracting the further information from at least one information source into the database; and

presenting said further information to the user.

74. A method according to Claim 73, further comprising determining a classification of the task based on said extracted information from at least one information source; wherein said further information is identified in dependence on said determined task classification.

75. A method according to Claim 74, wherein determining a classification of the task based on said received digital information comprises:

receiving a sequence of user accessed websites; and

76. A method according to any of Claims 73 to 75, wherein said further information is identified in dependence on one or more identified entities in the contents of the database.

77. A method according to any of Claims 73 to 76, further comprising predicting a future task in dependence on said task and wherein said further information is identified in dependence on said predicted future task.

78. A method according to any of Claims 73 to 77, wherein the database comprises information related to previous tasks and/or tasks performed by other users.

79. A method according to any of Claims 73 to 78, wherein said information related to previous tasks and/or tasks performed by other users is used to identify further information that is likely to be of relevance to the task.

80. A method according to any of Claims 73 to 79, wherein the information from at least one information source accessed by a user during a task comprises information relating to an entity.

81. A method according to any of Claims 73 to 80, further comprising identifying a primary entity in the information accessed by a user during a task in said database, wherein said further information relates to the primary entity.

82. A method according to Claim 81 , wherein retrieving said extracting the further information comprises querying at least one information source for data related to the primary entity.

83. A method according to any of Claims 73 to 82, wherein extracting the further information comprises scraping a website.

84. A method according to any of Claims 73 to 83, wherein extracting the further information comprises querying an external application program interface (API).

85. A method according to Claim 84, further comprising mapping said further information to the input of said API.

86. A method according to any of Claims 73 to 85, wherein presenting said further information to the user comprises, in the database, associating the further information with the information accessed by a user during a task.

87. A method according to any of Claims 73 to 86, wherein presenting said relevant data to the user comprises, in the database, associating the further information with information relating to the task.

88. A method according to any of Claims 73 to 87, wherein presenting said relevant data to the user comprises, in the database, associating the further information with information relating to one or more previous tasks.

89. A computer program product comprising software code adapted, when executed on a data processing apparatus, to perform all the steps of the method according to any of Claims 73 to 88.

90. A system for predictive searching of at least one information source, comprising: means for extracting information from at least one information source accessed by a user during a task into a database;

means for, using the information in the database, identifying further information that is likely to be of relevance to the task, wherein the further information is not included in the information in the database;

means for extracting the further information from at least one information source into the database; and

means for presenting said further information to the user.