RU2839037C1

RU2839037C1 - Method and system for obtaining vector presentations of data in table taking into account structure of table and its content

Info

Publication number: RU2839037C1
Application number: RU2024125243A
Authority: RU
Inventors: Максим Александрович Волков
Original assignee: Общество с ограниченной ответственностью "Сбер Бизнес Софт"
Filing date: 2024-08-28
Publication date: 2025-04-25

Abstract

FIELD: data processing.

SUBSTANCE: group of inventions relates to data processing and can be used to obtain vector representations of data in a table based on the structure of the table and its content. Method comprises the following steps: obtaining data, which includes: text, table structure; table is defined as a set from a list of table header cells and a list of table body cells; each cell of the table body is marked with tags characterizing: a table identifier, a list of atomic columns to which the cell belongs, a list of atomic rows to which the cell belongs; data of each cell of table body is supplemented with information from corresponding cells of headers; performing the text in the table tokenisation; performing position coding at table rows level; forming vector representations of tokens for each token in table by aggregation of vector representations of tokens and positional vector representations; attention matrix is generated, using cell belonging to column or row of table; storing coordinates of boundaries of table cells in sequence of table tokens; the base model receives at the input prepared text and position vector representations of tokens and an attention matrix and processes them to obtain contextualized vector representations of tokens; using stored coordinates of boundaries of table cells, pooling is used to obtain a vector representation of a table cell.

EFFECT: faster process of training a language model when working with spreadsheet documents.

6 cl, 5 dwg

Description

[1] Настоящее техническое решение, в общем, относится к способам модификации, обучения и использования языковых моделей, направленных на работу с таблицами, а именно к способу и системе получения векторных представлений данных в таблице с учетом структуры таблицы и ее содержания.[1] This technical solution generally relates to methods for modifying, training and using language models aimed at working with tables, namely to a method and system for obtaining vector representations of data in a table, taking into account the structure of the table and its contents.

УРОВЕНЬ ТЕХНИКИLEVEL OF TECHNOLOGY

[2] Языковая модель - это распределение вероятностей по последовательностям слов. Языковые модели генерируют вероятности путем обучения на корпусе текстов на одном или нескольких языках. Учитывая, что языки могут использоваться для выражения огромного множества верных предложений (так называемая цифровая бесконечность), языковое моделирование сталкивается с проблемой задания ненулевых вероятностей лингвистически верным последовательностям, которые могут никогда не встретиться в обучающих данных. Для преодоления этой проблемы было разработано несколько подходов к моделированию, таких как применение марковских цепей или использование нейронных архитектур, таких как рекуррентные нейронные сети или трансформеры.[2] A language model is a probability distribution over sequences of words. Language models generate probabilities by training on a corpus of texts in one or more languages. Given that languages can be used to express a vast number of valid sentences (the so-called digital infinity), language modeling faces the challenge of assigning non-zero probabilities to linguistically valid sequences that may never appear in the training data. Several modeling approaches have been developed to overcome this problem, such as the use of Markov chains or the use of neural architectures such as recurrent neural networks or transformers.

[3] Языковые модели полезны для решения множества задач вычислительной лингвистики; от первоначальных применений в распознавании речи для того, чтобы избежать генерации бессмысленных (то есть маловероятные) последовательностей слов, до более широкого использования в машинном переводе (например, оценка переводов-кандидатов), генерации естественного языка (генерация текста, более похожего на человеческий), разметки частей речи, синтаксического анализа, оптического распознавания символов, распознавания рукописного ввода, грамматических выводов, поиска информации и других приложений.[3] Language models are useful for a variety of computational linguistics problems; from initial applications in speech recognition to avoid generating meaningless (i.e. unlikely) word sequences, to more widespread use in machine translation (e.g. evaluating candidate translations), natural language generation (generating text that sounds more like human language), part-of-speech tagging, syntactic analysis, optical character recognition, handwriting recognition, grammar inference, information retrieval, and other applications.

[4] Существует большое количество архитектур, направленных на работу с документами, работающие на визуальной, текстовой информации и информации о расположении текста (layout) на странице, такие как:[4] There are a large number of architectures aimed at working with documents, working on visual, textual information and information about the location of text (layout) on the page, such as:

1. TableNet - end-to-end модель глубинного обучения, предназначенная для обнаружения таблиц и для распознавания структур.1. TableNet is an end-to-end deep learning model designed for table discovery and structure recognition.

2. Donut - энкодер-декодер архитектура на основе трансформера, работающий без OCR, энкодер принимает на вход изображение, декодер принимает на вход текст и выход энкодера для генерации текста.2. Donut - encoder-decoder architecture based on transformer, working without OCR, the encoder takes an image as input, the decoder takes text as input and the encoder output to generate text.

3. DiT - трансформенная архитектура для предобучения на изображениях документов.3. DiT - a transform architecture for pre-training on document images.

4. UDoP - Модель объединяет текст, изображение и layout с использованием одной модели на основе трансформера в общее пространство.4. UDoP - The model combines text, image and layout using a single transformer-based model into a common space.

5. DocLLM - легковесное расширение для традиционных LLM, для работы с документами, объединяющий текстовый запрос и layout, не используя визуальный энкодер.5. DocLLM - a lightweight extension for traditional LLM, for working with documents, combining text query and layout, without using a visual encoder.

[5] При этом они не специализированы для выполнения задачи выделения именованных сущностей в таблице, поэтому требуют большего кол-ва данных и вычислительных мощностей, для получения сравнимого результата на конкретной задаче.[5] However, they are not specialized to perform the task of selecting named entities in a table, and therefore require a larger amount of data and computing power to obtain a comparable result for a specific task.

[6] Существующие языковые модели, направленных на работу с таблицами, таких как: TaBERT (Yin и др., 2020), TURL (Deng и др. 2020), TAPAS (Herzig и др., 2020), TabNER (Koleva и др., 2022) - требуют доработок в стандартной архитектуре трансформеров (Vaswani и др., 2017 -https://arxiv.org/pdf/1706.03762.pdf) и предобучения на большом объеме данных для решения задачи получения векторных представлений данных в таблице с учетом структуры таблицы и ее содержания.[6] Existing language models aimed at working with tables, such as: TaBERT (Yin et al., 2020), TURL (Deng et al. 2020), TAPAS (Herzig et al., 2020), TabNER (Koleva et al., 2022) - require modifications to the standard transformer architecture (Vaswani et al., 2017 - https://arxiv.org/pdf/1706.03762.pdf) and pre-training on a large amount of data to solve the problem of obtaining vector representations of data in a table, taking into account the table structure and its contents.

[7] TaBERT добавляет механизм вертикального внимания, для контекстуализации эмбединга колонки, требующий предобучения (в известных примерах дообучение происходило на web-страницах с парсингом данных из html). Структура передаваемого в энкодер текста, не соответствует стандартной структуре подаваемый в трансформер: токен [SEP] несет иной смысл, чем в базовом трансформере, выполняя роль не разделителя между текстом, связь которых нужно установить, а разделителем между строками таблицы.[7] TaBERT adds a vertical attention mechanism for contextualizing column embedding, which requires pre-training (in known examples, pre-training occurred on web pages with data parsing from html). The structure of the text passed to the encoder does not correspond to the standard structure fed to the transformer: the [SEP] token has a different meaning than in the basic transformer, acting as a separator between table rows rather than a separator between texts whose connection needs to be established.

[8] В TURL эмбединг токенов в табличной части получается путем сложения вектора слова w, типа токена (заголовок или контент) t, и позиционного эмбединга р: x^∧t=w+t+p.Это не соответствует стандартной схеме трансформеров и требует дополнительного предобучения на большом объеме табличных данных. Предназначено для работы только с плоской таблицей из-за механизма внимания (видимости), основанного на колонках и строках.[8] In TURL, token embedding in the table part is obtained by adding the word vector w, the token type (title or content) t, and the positional embedding p: x ^∧ t=w+t+p. This does not correspond to the standard transformer scheme and requires additional pre-training on a large amount of tabular data. It is designed to work only with a flat table due to the attention (visibility) mechanism based on columns and rows.

[9] В TAPAS эмбединг токенов в табличной части получается путем сложения вектора слова w, позиционного эмбединга р, эмбединга сегмента s, эмбединг колонки с, эмбединг строки в таблице r и эмбединг ранга rg: x^∧t=w+p+c+s+r+rg. Это не соответствует стандартной схеме трансформеров и требует дополнительного переобучения на большом объеме табличных данных.[9] In TAPAS, token embedding in the table part is obtained by adding the word vector w, positional embedding p, segment embedding s, column embedding c, table row embedding r, and rank embedding rg: x ^∧ t=w+p+c+s+r+rg. This does not correspond to the standard transformer scheme and requires additional retraining on a large amount of tabular data.

[10] В TabNER эмбединг токенов в табличной части получается путем сложения вектора слова w, позиционного эмбединга pos, эмбединга сегмента seg: x^∧t=w+seg+pos, где эмбединг сегмента указывает - относится ли текст к заголовку или контенту таблицы, а не указывает отношение к семантической части, как в стандартном трансформере, что приводит к смещению распределения и невозможности использовать открытые модели. Также предполагает использование выходных эмбедингов модели для выделения именованных сущностей на уровне токенов, что увеличивает вероятность различения выделяемых сущностей.[10] In TabNER, token embedding in the tabular part is obtained by adding the word vector w, positional embedding pos, segment embedding seg: x ^∧ t=w+seg+pos, where the segment embedding indicates whether the text belongs to the heading or the content of the table, and does not indicate the relation to the semantic part, as in the standard transformer, which leads to a distribution bias and the inability to use open models. It also assumes the use of model output embeddings to extract named entities at the token level, which increases the probability of distinguishing the extracted entities.

РАСКРЫТИЕ ИЗОБРЕТЕНИЯDISCLOSURE OF INVENTION

[11] Данное техническое решение направлено на устранение недостатков, присущих существующим решениям, известным из уровня техники.[11] This technical solution is aimed at eliminating the shortcomings inherent in existing solutions known from the prior art.

[12] Решаемой технической проблемой в данном техническом решении является то, что архитектуры моделей, используемые для работы с документами, не используют структурные элементы таблиц в документе при построении представлений данных (эмбедингов) и решении целевой задачи (классификации, NER, вопросной-ответные системы и т.д.), как правило, заменяя использование структуры таблицы, большим кол-вом данных используемым на этапах предобучения и обучения, специфичных для конкретной сферы, лишая возможности использовать открытые к использованию модели, на выбранном языке, либо принуждают к использованию большого кол-ва данных и вычислительных мощностей для обобщения знаний находящихся в таблицах. SOTA архитектуры моделей, которые учитывают структуру таблицы при обучении, как правило, вносят изменения в архитектуру стандартных моделей, что делает невозможным использование открытых моделей без дополнительного предобучения, которое требует большое кол-во ресурсов и данных, а так же необходимость их предобрабатывать и отбирать данные соответствующие требуемой структуре, лишая возможности использовать обучение на больших неструктурированных датасетах.[12] The technical problem solved in this technical solution is that the model architectures used for working with documents do not use the structural elements of tables in the document when constructing data representations (embeddings) and solving the target problem (classification, NER, question-answering systems, etc.), as a rule, replacing the use of the table structure with a large amount of data used at the stages of pre-training and training specific to a particular area, depriving the ability to use open-to-use models in the selected language, or force the use of a large amount of data and computing power to generalize the knowledge contained in the tables. SOTA model architectures that take into account the table structure during training, as a rule, make changes to the architecture of standard models, which makes it impossible to use open models without additional pre-training, which requires a large amount of resources and data, as well as the need to pre-process them and select data corresponding to the required structure, depriving the ability to use training on large unstructured datasets.

[13] Основным техническим результатом, проявляющимся при решении вышеуказанной проблемы, является ускорение процесса обучения языковой модели при работе с табличными документами за счет обеспечения возможности использования любого предварительно обученного трансформера для получения представлений на уровне ячейки для таблицы, что позволяет использовать как открытые модели, обученные на большом объеме данных, так и специализированные модели, предобученные на неструктурированных данных, специфичных для домена документов, без предварительной подготовки и очистки корпуса.[13] The main technical result that emerges from solving the above problem is the acceleration of the process of training a language model when working with tabular documents by providing the ability to use any pre-trained transformer to obtain cell-level representations for a table, which allows the use of both open models trained on a large amount of data and specialized models pre-trained on unstructured data specific to the document domain, without preliminary preparation and cleaning of the corpus.

[14] Дополнительным техническим результатом, проявляющимся при решении вышеуказанной проблемы, является повышение точности решения задач языковой модели при работе с табличными документами за счет использования структурированности табличных данных и возможности работы с таблицами со сложной структурой.[14] An additional technical result that appears when solving the above-mentioned problem is an increase in the accuracy of solving language model problems when working with tabular documents due to the use of the structured nature of tabular data and the ability to work with tables with a complex structure.

[15] Указанные технические результаты достигаются благодаря осуществлению способа получения векторных представлений данных в таблице с учетом структуры таблицы и ее содержания, реализуемого с помощью процессора и устройства хранения данных, включающего следующие шаги:[15] The specified technical results are achieved by implementing a method for obtaining vector representations of data in a table taking into account the structure of the table and its contents, implemented using a processor and a data storage device, including the following steps:

• получают данные, включающие: текст, структуру таблицы;• receive data including: text, table structure;

• таблица определяется как набор из списка ячеек заголовков таблицы и списка ячеек тела таблицы;• a table is defined as a set of a list of table header cells and a list of table body cells;

• каждую ячейку тела таблицы помечают тегами, характеризующими: идентификатор таблицы, список атомарных колонок к которым относится ячейка, список атомарных строк к которым относится ячейка.• each cell of the table body is marked with tags that characterize: the table identifier, the list of atomic columns to which the cell belongs, the list of atomic rows to which the cell belongs.

• данные каждой ячейки тела таблицы дополняются информацией из соответствующих ячеек заголовков;• the data of each table body cell is supplemented with information from the corresponding header cells;

• производят токенизацию текста в таблице;• tokenize the text in the table;

• производят позиционное кодирование на уровне строк таблицы;• perform positional coding at the table row level;

• формируют векторные представления токенов для каждого токена в таблице путем агрегации векторных представлений токенов и позиционных векторных представлений;• form token vector representations for each token in the table by aggregating token vector representations and positional vector representations;

• формируют матрицу внимания, используя принадлежность ячейки к колонке или строке таблицы;• form an attention matrix using the cell’s belonging to a column or row of the table;

• сохраняют координаты границ ячеек таблицы в последовательности токенов таблицы;• store the coordinates of the table cell boundaries in the sequence of table tokens;

• базовая модель получает на вход подготовленные текстовые и позиционные векторные представления токенов и матрицу внимания и обрабатывает их, получая контекстуализированные векторные представления токенов;• the base model receives prepared text and positional vector representations of tokens and an attention matrix as input and processes them, obtaining contextualized vector representations of tokens;

• используя сохраненные координаты границ ячеек таблицы, используют пулинг для получения векторного представления ячейки таблицы.• using the stored coordinates of the table cell boundaries, use pooling to obtain a vector representation of the table cell.

[16] В одном из частных примеров осуществления способа базовая модель дообучается под выполнение конкретной задачи.[16] In one of the specific examples of the method’s implementation, the base model is further trained to perform a specific task.

[17] В другом частном примере осуществления способа базовая модель предобучена на неструктурированном тексте под домен, без необходимости наличия информации о таблицах.[17] In another particular example of the implementation of the method, the base model is pre-trained on unstructured text for the domain, without the need for information about tables.

[18] Кроме того, заявленный технический результат достигается за счет работы системы получения векторных представлений данных в таблице с учетом структуры таблицы и ее содержания, содержащей:[18] In addition, the stated technical result is achieved through the operation of a system for obtaining vector representations of data in a table, taking into account the structure of the table and its content, containing:

по меньшей мере одно устройство обработки данных;at least one data processing device;

по меньшей мере одно устройство хранения данных;at least one data storage device;

по меньшей мере одну программу, где одна или более программ хранятся на одном или более устройствах хранения данных и исполняются на одном и более устройствах обработки данных, причем одна или более программ обеспечивает выполнение следующих шагов:at least one program, where one or more programs are stored on one or more data storage devices and executed on one or more data processing devices, wherein the one or more programs ensure the execution of the following steps:

• базовая модель получает на вход подготовленные текстовые и позиционные векторные представления токенов и матрицу внимания, и обрабатывает их, получая контекстуализированные векторные представления токенов;• the base model receives prepared text and positional vector representations of tokens and an attention matrix as input and processes them, obtaining contextualized vector representations of tokens;

[19] В одном из частных примеров реализации системы базовая модель дообучается под выполнение конкретной задачи.[19] In one of the specific examples of the system’s implementation, the basic model is further trained to perform a specific task.

[20] В другом частном примере реализации системы базовая модель предобучена на неструктурированном тексте под домен, без необходимости наличия информации о таблицах.[20] In another particular example of the system implementation, the base model is pre-trained on unstructured text for the domain, without the need for information about tables.

КРАТКОЕ ОПИСАНИЕ ЧЕРТЕЖЕЙBRIEF DESCRIPTION OF DRAWINGS

[21] Признаки и преимущества настоящего технического решения станут очевидными из приводимого ниже подробного описания и прилагаемых чертежей, на которых:[21] The features and advantages of the present technical solution will become apparent from the following detailed description and the accompanying drawings, in which:

[22] Фиг. 1 иллюстрирует блок-схему выполнения заявленного способа.[22] Fig. 1 illustrates a block diagram of the implementation of the claimed method.

[23] Фиг. 2 иллюстрирует общую схему работы способа.[23] Fig. 2 illustrates the general scheme of operation of the method.

[24] Фиг. 3 иллюстрирует пример работы способа.[24] Fig. 3 illustrates an example of the method operation.

[25] Фиг. 4 иллюстрирует матрицу внимания.[25] Fig. 4 illustrates the attention matrix.

[26] Фиг. 5 иллюстрирует систему для реализации заявленного способа.[26] Fig. 5 illustrates a system for implementing the claimed method.

ОСУЩЕСТВЛЕНИЕ ИЗОБРЕТЕНИЯIMPLEMENTATION OF THE INVENTION

[27] Ниже будут описаны термины и понятия, необходимые для реализации настоящего технического решения.[27] Below we will describe the terms and concepts necessary for the implementation of this technical solution.

[28] Трансформер (англ. Transformer) - архитектура глубоких нейронных сетей, представленная в 2017 году исследователями из Google Brain. По аналогии с рекуррентными нейронными сетями (РНС) трансформеры предназначены для обработки последовательностей, таких как текст на естественном языке, и решения таких задач как машинный перевод и автоматическое реферирование. В отличие от РНС, трансформеры не требуют обработки последовательностей по порядку. Например, если входные данные - это текст, то трансформеру не требуется обрабатывать конец текста после обработки его начала. Благодаря этому трансформеры распараллеливаются легче, чем РНС, и могут быть быстрее обучены. Архитектура трансформера состоит из кодировщика и декодировщика. Кодировщик получает на вход векторизованую последовательность с позиционной информацией. Декодировщик получает на вход часть этой последовательности и выход кодировщика. Кодировщик и декодировщик состоят из слоев. Слои кодировщика последовательно передают результат следующему слою в качестве его входа. Слои декодировщика последовательно передают результат следующему слою вместе с результатом кодировщика в качестве его входа. Каждый кодировщик состоит из механизма самовнимания (вход из предыдущего слоя) и нейронной сети с прямой связью (вход из механизма самовнимания). Каждый декодировщик состоит из механизма самовнимания (вход из предыдущего слоя), механизма внимания к результатам кодирования (вход из механизма самовнимания и кодировщика) и нейронной сети с прямой связью (вход из механизма внимания).[28] Transformer is a deep neural network architecture introduced in 2017 by researchers at Google Brain. Similar to recurrent neural networks (RNNs), transformers are designed to process sequences such as natural language text and solve problems such as machine translation and automatic summarization. Unlike RNNs, transformers do not require processing sequences in order. For example, if the input is text, a transformer does not need to process the end of the text after processing its beginning. Due to this, transformers are easier to parallelize than RNNs and can be trained faster. The transformer architecture consists of an encoder and a decoder. The encoder receives a vectorized sequence with positional information as input. The decoder receives a portion of this sequence and the output of the encoder. The encoder and decoder are composed of layers. The encoder layers sequentially pass the result to the next layer as its input. The decoder layers sequentially pass the output to the next layer, along with the encoder output as its input. Each encoder consists of a self-attention mechanism (input from the previous layer) and a feedforward neural network (input from the self-attention mechanism). Each decoder consists of a self-attention mechanism (input from the previous layer), an attention mechanism to the encoding results (input from the self-attention mechanism and the encoder), and a feedforward neural network (input from the attention mechanism).

[29] Эмбеддинг (англ. embedding) - это вектор, представленный в виде массива чисел, который получается в результате преобразования данных, например, текста. Комбинация этих чисел, составляющих вектор, действует как многомерная карта для измерения сходства.[29] An embedding is a vector represented as an array of numbers that results from transforming data, such as text. The combination of these numbers that make up the vector acts as a multidimensional map for measuring similarity.

Использование векторных представлений (эмбеддингов) позволяет:Using vector representations (embeddings) allows:

• уменьшить размерность данных - с помощью эмбеддингов вы можете представить текстовые запросы в виде числовых векторов, что позволяет снизить размерность данных и ускорить их обработку;• reduce the dimensionality of data - using embeddings, you can represent text queries as numerical vectors, which allows you to reduce the dimensionality of data and speed up its processing;

• улучшить качество поиска - эмбеддинги позволяют оценивать сходство между текстовыми запросами на основе расстояния между соответствующими векторами. Это позволяет улучшить качество поиска и релевантность результатов;• improve search quality - embeddings allow you to evaluate the similarity between text queries based on the distance between the corresponding vectors. This improves search quality and the relevance of results;

• обеспечить универсальность - эмбеддинги можно использовать для различных задач обработки естественного языка, таких как Retrieval Augmented Generation (RAG), классификация текстов, кластеризация и других.• provide versatility - embeddings can be used for various natural language processing tasks, such as Retrieval Augmented Generation (RAG), text classification, clustering and others.

[30] Пулинг (Pooling) - это операция в обработке изображений и других данных, которая используется для уменьшения размерности данных и выделения наиболее значимых признаков. Существует несколько видов пулинга, включая max pooling (максимальное объединение) и average pooling (среднее объединение).[30] Pooling is an operation in image and other data processing that is used to reduce the dimensionality of data and extract the most significant features. There are several types of pooling, including max pooling and average pooling.

[31] Заявленное техническое решение может выполняться, например системой, машиночитаемым носителем, сервером и т.д. В данном техническом решении под системой подразумевается, в том числе компьютерная система, ЭВМ (электронно-вычислительная машина), ЧПУ (числовое программное управление), ГОЖ (программируемый логический контроллер), компьютеризированные системы управления и любые другие устройства, способные выполнять заданную, четко определенную последовательность операций (действий, инструкций).[31] The claimed technical solution may be implemented, for example, by a system, a machine-readable medium, a server, etc. In this technical solution, a system means, among other things, a computer system, a computer (electronic computer), a CNC (computer numerical control), a programmable logic controller (PLC), computerized control systems, and any other devices capable of performing a given, clearly defined sequence of operations (actions, instructions).

[32] Под устройством обработки команд подразумевается электронный блок либо интегральная схема (микропроцессор), исполняющая машинные инструкции (программы).[32] A command processing unit is an electronic unit or integrated circuit (microprocessor) that executes machine instructions (programs).

[33] Устройство обработки команд считывает и выполняет машинные инструкции (программы) с одного или более устройства хранения данных, например таких устройств, как оперативно запоминающие устройства (ОЗУ) и/или постоянные запоминающие устройства (ПЗУ). В качестве ПЗУ могут выступать, но, не ограничиваясь, жесткие диски (HDD), флеш-память, твердотельные накопители (SSD), оптические носители данных (CD, DVD, BD, MD и т.п.) и др.[33] The command processing unit reads and executes machine instructions (programs) from one or more data storage devices, such as random access memory (RAM) and/or read-only memory (ROM). ROM may include, but is not limited to, hard disk drives (HDD), flash memory, solid-state drives (SSD), optical storage media (CD, DVD, BD, MD, etc.), etc.

[34] Программа - последовательность инструкций, предназначенных для исполнения устройством управления вычислительной машины или устройством обработки команд.[34] A program is a sequence of instructions intended for execution by a computer control device or a command processing device.

[35] Термин «инструкции», используемый в этой заявке, может относиться, в общем, к программным инструкциям или программным командам, которые написаны на заданном языке программирования для осуществления конкретной функции, такой как, например, получение и обработка данных, формирование профиля пользователя, прием и передача сигналов, анализ принятых данных, идентификация пользователя и т.п.Инструкции могут быть осуществлены множеством способов, включающих в себя, например, объектно-ориентированные методы. Например, инструкции могут быть реализованы, посредством языка программирования С++, Java, Python, различных библиотек (например, Microsoft Foundation Classes) и т.д. Инструкции, осуществляющие процессы, описанные в этом решении, могут передаваться как по проводным, так и по беспроводным каналам передачи данных, например, Wi-Fi, Bluetooth, USB, WLAN, LAN и т.п.[35] The term "instructions" as used in this application may refer generally to software instructions or software commands that are written in a given programming language to perform a specific function, such as, for example, receiving and processing data, forming a user profile, receiving and transmitting signals, analyzing received data, identifying a user, etc. The instructions may be implemented in a variety of ways, including, for example, object-oriented methods. For example, the instructions may be implemented using the C++ programming language, Java, Python, various libraries (e.g., Microsoft Foundation Classes), etc. The instructions that perform the processes described in this solution may be transmitted via both wired and wireless data transmission channels, such as Wi-Fi, Bluetooth, USB, WLAN, LAN, etc.

[36] Представленный способ получения векторных представлений данных в таблице с учетом структуры таблицы и ее содержания (на Фиг. 1 представлена блок-схема выполнения способа) решает задачи ускорения процесса обучения языковой модели при работе с табличными документами и повышения точности решения задач языковой модели при работе с табличными документами за счет последовательного выполнения следующих шагов:[36] The presented method for obtaining vector representations of data in a table taking into account the structure of the table and its contents (Fig. 1 shows a block diagram of the method) solves the problems of accelerating the process of training a language model when working with tabular documents and increasing the accuracy of solving problems of a language model when working with tabular documents by sequentially performing the following steps:

[37] Предлагаемый способ (на Фиг. 2 показана общая схема работы способа) вносит изменения в структуру работы стандартных моделей, не изменяя распределение входных эмбедингов (не добавляет новый вид позиционных, ранговых или других эмбедингов и не переопределяет использование стандартных видов эмбедингов и токенов), поэтому не требует дополнительного предобучения на большом объеме данных, и позволяет использовать открытые модели, на выбранном языке или делать предобучение для подгонки к домену (строительная документация, финансовые отчеты, юридические документы и т.д.) без необходимости специфичной предобработки или отбора данных. А также позволяет использовать в качестве базовой модели, использующие не только текстовые данные, но и модели с использованием информацию с картинки и layout, такие как UDOP и LayoutLMv3.[37] The proposed method (Fig. 2 shows the general scheme of the method's operation) makes changes to the structure of the standard models without changing the distribution of input embeddings (it does not add a new type of positional, rank or other embeddings and does not redefine the use of standard types of embeddings and tokens), therefore it does not require additional pre-training on a large amount of data, and allows the use of open models in a selected language or pre-training for fitting to a domain (construction documentation, financial reports, legal documents, etc.) without the need for specific pre-processing or data selection. It also allows the use of models using not only text data as a base model, but also models using information from a picture and layout, such as UDOP and LayoutLMv3.

[38] Таким образом, у представленного технического решения следующие преимущества:[38] Thus, the presented technical solution has the following advantages:

1) нет необходимости дополнительного предобучения, для подгонки к архитектуре базового эмбедера:1) there is no need for additional pre-training to fit the architecture of the base embedder:

1. можно использовать открытые модели;1. open models can be used;

2. можно предобучать модели под домен на неструктурированом тексте;2. it is possible to pre-train models for a domain on unstructured text;

3. возможность использовать в качестве базовой модели почти любой модели, основанной на архитектуре трансформер, следовательно, возможно использовать так же информацию об изображении и 2-d координат;3. the ability to use almost any model based on the transformer architecture as a base model, therefore, it is possible to also use information about the image and 2-D coordinates;

2) использует структурированность табличных данных;2) uses the structure of tabular data;

3) Обеспечивает возможность работы с таблицей сложной структуры.3) Provides the ability to work with a table of complex structure.

[39] В частном примере реализации заявленного технического решения (пример работы показан на Фиг. 3) в качестве основы для базового эмбедера токенов (ENC) может выступать почти любая модель на основе архитектуры трансформер и слоя классификации (например, ruBert-base: https://huggingface.co/ai -forever/ruBert-base).[39] In a specific example of the implementation of the declared technical solution (an example of operation is shown in Fig. 3), almost any model based on the transformer architecture and classification layer (for example, ruBert-base: https://huggingface.co/ai-forever/ruBert-base) can serve as the basis for the basic token embedder (ENC).

Векторные представления токенов для каждого токена в линеаризованной таблице формируются путем агрегации вложений токенов и вложений позиций.Token vector representations for each token in the linearized table are formed by aggregating token embeddings and position embeddings.

[40] Таблица определяется как кортеж Т=(С, H), где - это набор ячеек тела таблицы для n строк и m столбцов. Каждая ячейка представляет собой последовательность токенов длины t. Заголовок таблицы - это набор соответствующих ячеек заголовка столбца, где представляет собой последовательность заголовочных токенов длины q.[40] A table is defined as a tuple T=(C, H), where - is a set of table body cells for n rows and m columns. Each cell is a sequence of tokens of length t. Table header - is a set of corresponding column header cells, where is a sequence of header tokens of length q.

Используется T[i,:], чтобы обратиться к i-й строке (Н=Т[k,:]), где k-строка с окончанием заголовка таблицы (таблица может иметь сложную структуру) и - чтобы обратиться к j-му столбцу Т.Use T[i,:] to access the i-th row (H=T[k,:]), where k is the row with the end of the table header (the table may have a complex structure) and - to access the j-th column of T.

Приведем пример, когда у каждой помеченной ячейки есть последовательность тегов NER: где каждый Let's give an example where each tagged cell has a sequence of NER tags: where everyone

Используются теги IO, поэтому где, например, ENT ∈ {NUM,RES,0RG,T0T}.IO tags are used, so where, for example, ENT ∈ {NUM,RES,0RG,T0T}.

Позиционное кодирование происходит на уровне строки таблицы, где токены заголовка таблицы h_j учитывает в расчет порядка, таким образом сохраняется распределение во входном слое базовой модели.Positional encoding occurs at the table row level, where table header tokens h _j are taken into account in the order calculation, thus preserving the distribution in the input layer of the base model.

Используется маска внимания к таблице (матрицу видимости) α_i,j, но на уровне токена, а не ячейки. Эта маска позволяет каждому токену обращаться исключительно к токенам в той же строке или столбце, α_i,j - это симметричная бинарная матрица, определенная как:A table attention mask (visibility matrix) α _i,j is used, but at the token level rather than the cell level. This mask allows each token to access only tokens in the same row or column, α _i,j is a symmetric binary matrix defined as:

Здесь row (col) - функции, отображающие линеаризованные индексы токенов обратно в индексы строк (столбцов) в таблице.Here row (col) are functions that map linearized token indices back to row (column) indices in the table.

Классификация происходит на уровне ячеек (что не отменяет возможность делать классификацию, на уровне токенов, но это повышает вероятность ошибки и разрыва сущности внутри ячейки, а значит требует большого кол-ва данных для получения сопоставимого результата), поэтому сохраняются не только координаты исходных слов в последовательности токенов (bpe как правило, зависит от модели), но и координаты границ ячеек таблицы в последовательности токенов.Classification occurs at the cell level (which does not cancel the possibility of classification at the token level, but it increases the probability of error and entity rupture within the cell, and therefore requires a large amount of data to obtain a comparable result), therefore, not only the coordinates of the original words in the token sequence are saved (bpe usually depends on the model), but also the coordinates of the table cell boundaries in the token sequence.

Выход слоя кодировщика токенов - это последовательность представлений токенов:The output of the token encoder layer is a sequence of token representations:

Используя сохраненные границы ячеек (или слов) таблицы, осуществляется пулинг (https://arxiv.org/abs/2009.07485 - максимальный, средневзвешенный, или другие виды пулинга) для получения представления ячейки таблицы. Предпочитаемый метод пулинга может зависеть от архитектуры и способа предобучения базового эмбедера.Using the stored table cell (or word) boundaries, pooling (https://arxiv.org/abs/2009.07485 - max, weighted average, or other types of pooling) is performed to obtain a table cell representation. The preferred pooling method may depend on the architecture and pretraining method of the underlying embedder.

Затем эти представления передаются в слой классификации с активацией Softmax для присвоения оценки каждому токену класса These representations are then fed into a classification layer with Softmax activation to assign a score to each class token.

[41] В качестве подробного примера реализации представлено описание частного варианта осуществления способа:[41] As a detailed example of implementation, a description of a particular embodiment of the method is presented:

Шаг 1: получают данные, включающие: текст, структуру таблицы;Step 1: receive data including: text, table structure;

На вход приходит:At the entrance comes:

1) Изображение в виде таблицы, пример ниже:1) Image in the form of a table, example below:

2) Результаты полнотекстового распознавания картинки включающую текст, его расположение и структуру таблицы с помощью, например, таких средств как SberOCR, Tesseract, ABBYY FineReader или другие).2) Results of full-text image recognition including text, its location and table structure using, for example, tools such as SberOCR, Tesseract, ABBYY FineReader or others).

Ниже показан пример получаемого полнотекстового распознавания (с помощью SberOCR). Общая информация:Below is an example of the resulting full-text recognition (using SberOCR). General information:

Ниже показан пример получаемого полнотекстового распознавания (с помощью SberOCR). Отдельная ячейка:Below is an example of the resulting full-text recognition (using SberOCR). A separate cell:

Шаг 2: таблица определяется как набор из списка ячеек заголовков таблицы и списка ячеек тела таблиц.Step 2: A table is defined as a set of a list of table header cells and a list of table body cells.

Ниже пример визуального представления таблицы в виде заголовка и последовательности нумерованных ячеек и колонок:Below is an example of a visual representation of a table as a header and a sequence of numbered cells and columns:

Шаг 3: каждую ячейку тела таблицы помечают тегами, характеризующими: идентификатор таблицы, список атомарных колонок к которым относится ячейка, список атомарных строк к которым относится ячейка.Step 3: Each cell of the table body is marked with tags that characterize: the table identifier, the list of atomic columns to which the cell belongs, the list of atomic rows to which the cell belongs.

Текст разбивается на последовательности. Каждая последовательность содержит в себе информацию о таблице (table_id), колонке (column_idxs) и строке таблицы (row_idxs). Для нетабличных последовательностей эти значения равны None. Ниже показан пример считанных последовательностей:The text is split into sequences. Each sequence contains information about the table (table_id), column (column_idxs) and table row (row_idxs). For non-table sequences, these values are None. An example of the sequences read is shown below:

Шаг 4: данные каждой ячейки тела таблицы дополняются информацией из соответствующих ячеек заголовков. Ячейки-заголовки ограничиваются нижним значением строки или определяются вручную:Step 4: The data of each table body cell is supplemented with information from the corresponding header cells. Header cells are limited to the lower value of the row or are defined manually:

Данные каждой ячейки тела таблицы дополняются информацией из соответствующих ячеек заголовков:The data of each table body cell is supplemented with information from the corresponding header cells:

Шаг 5: производят токенизацию текста в таблице.Step 5: Tokenize the text in the table.

Текст разбивается на значимые для модели части текста (токены) и дополняются служебными токенами необходимые для работы модели - ниже пример значимых для модели токенов и полученных из исходного текста:The text is broken down into parts of the text (tokens) that are significant for the model and supplemented with service tokens necessary for the model to work - below is an example of tokens that are significant for the model and those obtained from the source text:

Шаг 6: производят позиционное кодирование на уровне строк таблицы. Позиционное кодирование происходит на уровне строки таблицы, где токены заголовка таблицы учитывает в расчет порядка, таким образом сохраняется распределение во входном слое базовой модели.Step 6: Perform positional encoding at the table row level. Positional encoding occurs at the table row level, where table header tokens are taken into account in the order calculation, thus preserving the distribution in the input layer of the base model.

Шаг 7: формируют векторные представления токенов для каждого токена в таблице путем агрегации векторных представлений токенов и позиционных векторных представлений - ниже пример векторных представлений, получаемых из первой части модели трансформера - энкодера:Step 7: generate token vector representations for each token in the table by aggregating token vector representations and positional vector representations - below is an example of the vector representations obtained from the first part of the transformer-encoder model:

Шаг 8: формируют матрицу внимания, используя принадлежность ячейки к колонке или строке таблицы. Матрица внимания передается в модель трансформера. Матрица внимания состоит из 0 и 1. Где значение 1 означает, что токены обуславливается друг на друга, то есть зависимы. Значение 0 означает независимость токенов для модели и необязательность для проведения расчетов для этих токенов. Каждый токен в и вне таблицы обуславливается на все токены вне таблицы. Внутри таблицы токены обуславливаются только на токены находящиеся в одной строке или колонке. На Фиг. 4 изображена матрица внимания, где белый цвет - 1, черный - 0.Step 8: Form an attention matrix using the cell's belonging to a column or row of the table. The attention matrix is passed to the transformer model. The attention matrix consists of 0 and 1. Where the value 1 means that the tokens are conditioned on each other, i.e. dependent. The value 0 means that the tokens are independent for the model and are not required to perform calculations for these tokens. Each token in and outside the table is conditioned on all tokens outside the table. Inside the table, tokens are conditioned only on tokens in the same row or column. Fig. 4 shows the attention matrix, where white is 1, black is 0.

Шаг 9: сохраняют координаты границ ячеек таблицы в последовательности токенов таблицы.Step 9: Store the coordinates of the table cell boundaries in the table token sequence.

Границы последовательностей (строки вне таблиц и ячейки в таблице) сохраняются в виде списка, где первое значение указывает на первый токен последовательности (ячейки или строки), второе значение указывает на последний токен последовательности (ячейки или строки) и будут использованы после получения векторных представлений токенов для получения векторных представлений ячеек.Sequence boundaries (rows outside tables and cells in tables) are stored as a list, where the first value points to the first token of the sequence (cell or row), the second value points to the last token of the sequence (cell or row), and will be used after obtaining vector representations of the tokens to obtain vector representations of the cells.

Ниже показан пример списка границ последовательностей, указываемые в токенах:Below is an example of a list of sequence boundaries specified in tokens:

Шаг 10: базовая модель получает на вход подготовленные текстовые и позиционные векторные представления токенов и матрицу внимания, и обрабатывает их, получая контекстуализированные векторные представления токенов.Step 10: The base model receives the prepared text and positional token embeddings and the attention matrix as input and processes them to obtain contextualized token embeddings.

Модели трансформера на вход подаются индексы токенов, матрица внимания и прочие необходимые для работы конкретной модели. На выходе мы получаем список из векторных представлений каждого токена, обогащенного контекстуальной информацией важной для данного токена. Количество выходных векторов равно кол-ву токенов.The transformer model receives token indices, an attention matrix, and other information necessary for the operation of a specific model as input. At the output, we receive a list of vector representations of each token, enriched with contextual information important for this token. The number of output vectors is equal to the number of tokens.

Первым этапом модели является получение неконтекстуализированных векторных представлений слов из словаря. К ним, как правило, добавляется векторное представление позиции.The first step of the model is to obtain uncontextualized word vector representations from the dictionary. These are usually supplemented by a position vector representation.

Далее эти векторные представления подаются на вход основной части модели трансформера, осуществляющую контекстуализацию векторных представлений слов путем механизма self-attention.These vector representations are then fed to the input of the main part of the transformer model, which contextualizes the vector representations of words using the self-attention mechanism.

В качестве базовой модели использовалась модель BertModel из библиотеки transformers:The BertModel model from the transformers library was used as the base model:

https://github.com/huggingface/transformers/blob/main/transformers/models/bert/modeling_bert.py#L956.https://github.com/huggingface/transformers/blob/main/transformers/models/bert/modeling_bert.py#L956.

Ниже приведен пример списка из векторных представлений токенов на выходе модели трансформера: Below is an example of a list of vector representations of tokens output from a transformer model:

Шаг 11: используя сохраненные координаты границ ячеек таблицы, используют пулинг для получения векторного представления ячейки таблицы. По сохраненным координатам границы ячеек производится пулинг. После применения пулинга токенов в границах сохраненных токенов получаем кол-во векторов равное кол-ву ячеек таблицы + кол-во внетабличных последовательностей (строк) - ниже пример списка из векторных представлений последовательностей (ячеек и строк):Step 11: using the saved coordinates of the table cell boundaries, use pooling to obtain a vector representation of the table cell. Pooling is performed using the saved coordinates of the cell boundary. After applying token pooling within the boundaries of the saved tokens, we obtain the number of vectors equal to the number of table cells + the number of extra-table sequences (rows) - below is an example of a list of vector representations of sequences (cells and rows):

[42] В общем виде (см. Фиг. 5) система получения векторных представлений данных в таблице с учетом структуры таблицы и ее содержания (500) содержит объединенные общей шиной информационного обмена один или несколько процессоров (501), средства памяти, такие как ОЗУ (502) и ПЗУ (503) и интерфейсы ввода/вывода (504).[42] In general (see Fig. 5), the system for obtaining vector representations of data in a table taking into account the structure of the table and its contents (500) contains one or more processors (501), memory means such as RAM (502) and ROM (503), and input/output interfaces (504), united by a common information exchange bus.

[43] Процессор (501) (или несколько процессоров, многоядерный процессор и т.п.) может выбираться из ассортимента устройств, широко применяемых в настоящее время, например, таких производителей, как: Intel™, AMD™, Apple™, Samsung Exynos™, MediaTEK™, Qualcomm Snapdragon™ и т.п. Под процессором или одним из используемых процессоров в системе (500) также необходимо учитывать графический процессор, например, GPU NVIDIA с программной моделью, совместимой с CUDA, или Graphcore, тип которых также является пригодным для полного или частичного выполнения способа, а также может применяться для обучения и применения моделей машинного обучения в различных информационных системах.[43] The processor (501) (or several processors, a multi-core processor, etc.) can be selected from a range of devices that are widely used at present, for example, from manufacturers such as: Intel™, AMD™, Apple™, Samsung Exynos™, MediaTEK™, Qualcomm Snapdragon™, etc. The processor or one of the processors used in the system (500) must also include a graphics processor, for example, an NVIDIA GPU with a software model compatible with CUDA, or Graphcore, the type of which is also suitable for the full or partial implementation of the method, and can also be used for training and applying machine learning models in various information systems.

[44] ОЗУ (502) представляет собой оперативную память и предназначено для хранения исполняемых процессором (501) машиночитаемых инструкций для выполнения необходимых операций по логической обработке данных. ОЗУ (502), как правило, содержит исполняемые инструкции операционной системы и соответствующих программных компонент (приложения, программные модули и т.п.). При этом, в качестве ОЗУ (502) может выступать доступный объем памяти графической карты или графического процессора.[44] RAM (502) is a random access memory and is intended for storing machine-readable instructions executed by the processor (501) for performing the necessary operations for logical data processing. RAM (502), as a rule, contains executable instructions of the operating system and the corresponding software components (applications, software modules, etc.). In this case, the available memory capacity of the graphic card or graphic processor may act as RAM (502).

[45] ПЗУ (503) представляет собой одно или более устройств постоянного хранения данных, например, жесткий диск (HDD), твердотельный накопитель данных (SSD), флэш-память (EEPROM, NAND и т.п.), оптические носители информации (CD-R/RW, DVD-R/RW, BlueRay Disc, MD) и др.[45] ROM (503) represents one or more permanent storage devices, such as a hard disk drive (HDD), a solid-state drive (SSD), flash memory (EEPROM, NAND, etc.), optical storage media (CD-R/RW, DVD-R/RW, BlueRay Disc, MD), etc.

[46] Для организации работы компонентов устройства (500) и организации работы внешних подключаемых устройств применяются различные виды интерфейсов В/В (504). Выбор соответствующих интерфейсов зависит от конкретного исполнения вычислительного устройства, которые могут представлять собой, не ограничиваясь: PCI, AGP, PS/2, IrDa, Fire Wire, LPT, COM, SATA, IDE, Lightning, USB (2.0, 3.0, 3.1, micro, mini, type C), TRS/Audio jack (2.5, 3.5, 6.35), HDMI, DVI, VGA, Display Port, RJ45, RS232 и т.п.[46] To organize the operation of the device components (500) and to organize the operation of external connected devices, various types of I/O interfaces (504) are used. The choice of the corresponding interfaces depends on the specific design of the computing device, which may include, but are not limited to: PCI, AGP, PS/2, IrDa, Fire Wire, LPT, COM, SATA, IDE, Lightning, USB (2.0, 3.0, 3.1, micro, mini, type C), TRS/Audio jack (2.5, 3.5, 6.35), HDMI, DVI, VGA, Display Port, RJ45, RS232, etc.

[47] Для обеспечения взаимодействия пользователя с устройством (500) применяются различные средства (505) В/В информации, например, клавиатура, дисплей (монитор), сенсорный дисплей, тач-пад, джойстик, манипулятор мышь, световое перо, стилус, сенсорная панель, трекбол, динамики, микрофон, средства дополненной реальности, оптические сенсоры, планшет, световые индикаторы, проектор, камера, средства биометрической идентификации (сканер сетчатки глаза, сканер отпечатков пальцев, модуль распознавания голоса) и т.п.[47] To ensure user interaction with the device (500), various means (505) of I/O information are used, for example, a keyboard, display (monitor), touch display, touchpad, joystick, mouse, light pen, stylus, touch panel, trackball, speakers, microphone, augmented reality means, optical sensors, tablet, light indicators, projector, camera, biometric identification means (retina scanner, fingerprint scanner, voice recognition module), etc.

[48] Средство сетевого взаимодействия (506) обеспечивает передачу данных посредством внутренней или внешней вычислительной сети, например, Интранет, Интернет, ЛВС и т.п. В качестве одного или более средств (506) может использоваться, но не ограничиваться: Ethernet карта, GSM модем, GPRS модем, LTE модем, 5G модем, модуль спутниковой связи, NFC модуль, Bluetooth и/или BLE модуль, Wi-Fi модуль и др.[48] The network interaction means (506) provides data transmission via an internal or external computer network, such as an Intranet, the Internet, a LAN, etc. One or more means (506) may be, but are not limited to: an Ethernet card, a GSM modem, a GPRS modem, an LTE modem, a 5G modem, a satellite communication module, an NFC module, a Bluetooth and/or BLE module, a Wi-Fi module, etc.

[49] Конкретный выбор элементов устройства (500) для реализации различных программно-аппаратных архитектурных решений может варьироваться с сохранением обеспечиваемого требуемого функционала. В частности, подобная реализация может быть выполнена с помощью электронных компонент, используемых для создания цифровых интегральных схем. Не ограничиваюсь, могут быть использоваться микросхемы, логика работы которых определяется при изготовлении, или программируемые логические интегральные схемы (ПЛИС), логика работы которых задается посредством программирования. Для программирования используются программаторы и отладочные среды, позволяющие задать желаемую структуру цифрового устройства в виде принципиальной электрической схемы или программы на специальных языках описания аппаратуры: Verilog, VHDL, AHDL и др. Альтернативой ПЛИС являются: программируемые логические контроллеры (ПЛК), базовые матричные кристаллы (БМК), требующие заводского производственного процесса для программирования; ASIC - специализированные заказные большие интегральные схемы (БИС), которые при мелкосерийном и единичном производстве существенно дороже. Таким образом, реализация может быть достигнута стандартными средствами, базирующимися на классических принципах реализации основ вычислительной техники.[49] The specific choice of device elements (500) for implementing various software and hardware architectural solutions may vary while maintaining the required functionality. In particular, such an implementation may be performed using electronic components used to create digital integrated circuits. I am not limited to, but may use microcircuits whose operating logic is determined during manufacture, or programmable logic integrated circuits (FPGAs), whose operating logic is specified by programming. For programming, programmers and debugging environments are used that allow you to specify the desired structure of a digital device in the form of a basic electrical circuit or a program in special hardware description languages: Verilog, VHDL, AHDL, etc. An alternative to FPGAs are: programmable logic controllers (PLCs), basic matrix crystals (BMCs), which require a factory production process for programming; ASICs - specialized custom large-scale integrated circuits (LSI), which are significantly more expensive for small-scale and individual production. Thus, the implementation can be achieved by standard means based on classical principles of implementing the basics of computing technology.

[50] Представленные материалы заявки раскрывают предпочтительные примеры реализации технического решения и не должны трактоваться как ограничивающие иные, частные примеры его воплощения, не выходящие за пределы испрашиваемой правовой охраны, которые являются очевидными для специалистов соответствующей области техники.[50] The submitted application materials disclose preferred examples of the implementation of the technical solution and should not be interpreted as limiting other, particular examples of its implementation that do not go beyond the scope of the requested legal protection, which are obvious to specialists in the relevant field of technology.

Claims

1. A method for obtaining vector representations of data in a table taking into account the table structure and its contents, implemented using a processor and a data storage device, including the following steps:

- receive data including: text, table structure;

- a table is defined as a set of a list of table header cells and a list of table body cells;

- each cell of the table body is marked with tags that characterize: the table identifier, the list of atomic columns to which the cell belongs, the list of atomic rows to which the cell belongs.

- the data of each table body cell is supplemented with information from the corresponding header cells;

- tokenize the text in the table;

- perform positional coding at the table row level;

- form vector representations of tokens for each token in the table by aggregating vector representations of tokens and positional vector representations;

- form an attention matrix using the cell’s belonging to a column or row of the table;

- save the coordinates of the table cell boundaries in the sequence of table tokens;

- the basic model receives prepared text and positional vector representations of tokens and an attention matrix as input and processes them, obtaining contextualized vector representations of tokens;

- using the stored coordinates of the table cell boundaries, use pooling to obtain a vector representation of the table cell.

2. The method according to paragraph 1, characterized in that the basic model is further trained to perform a specific task.

3. The method according to item 1, characterized in that the base model is pre-trained on unstructured text for the domain, without the need for information about tables.

4. A system for obtaining vector representations of data in a table, taking into account the structure of the table and its contents, used for machine learning, containing:

at least one data processing device;

at least one data storage device;

at least one program, where one or more programs are stored on one or more data storage devices and executed on one or more data processing devices, wherein the one or more programs ensure the execution of the following steps:

- receive data including: text, table structure;

- tokenize the text in the table;

- perform positional coding at the table row level;

5. The system according to paragraph 4, characterized in that the basic model is further trained to perform a specific task.

6. The system according to item 4, characterized in that the base model is pre-trained on unstructured text for the domain, without the need for information about tables.