CN116401566A - Grouping method, grouping device, computer equipment and storage medium for data objects - Google Patents
Grouping method, grouping device, computer equipment and storage medium for data objects Download PDFInfo
- Publication number
- CN116401566A CN116401566A CN202310403129.6A CN202310403129A CN116401566A CN 116401566 A CN116401566 A CN 116401566A CN 202310403129 A CN202310403129 A CN 202310403129A CN 116401566 A CN116401566 A CN 116401566A
- Authority
- CN
- China
- Prior art keywords
- data object
- group
- data
- object group
- groups
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/2433—Single-class perspective, e.g. one-against-all classification; Novelty detection; Outlier detection
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/30—Post-processing
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/02—Total factory control, e.g. smart factories, flexible manufacturing systems [FMS] or integrated manufacturing systems [IMS]
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Description
技术领域technical field
本发明涉及数据处理技术领域,具体涉及一种数据对象的分组方法、装置、计算机设备及存储介质。The present invention relates to the technical field of data processing, in particular to a data object grouping method, device, computer equipment and storage medium.
背景技术Background technique
对照实验是通过设置“实验组”和“对照组”,使得除“实验处理因素”之外两个组的其他属性完全相同或对等,从而证明实验处理因素是否产生某种结果(目标变量的变化或差异)的研究方法,在医学、生物学、农学、工程科学、互联网技术乃至社会经济领域都有广泛的应用。开展对照实验的关键过程是要保证“实验组”和“对照组”的可比性,通常分组时采用随机抽样的方法,两个组的样本之间相互独立、个体差异较小。Controlled experiments are to prove whether the experimental treatment factors produce certain results (the Change or difference) research methods are widely used in medicine, biology, agronomy, engineering science, Internet technology and even socio-economic fields. The key process of carrying out controlled experiments is to ensure the comparability of the "experimental group" and "control group". Usually, random sampling is used when grouping. The samples of the two groups are independent of each other and the individual differences are small.
但在很多领域的研究中,研究人员通常很难对参与实验的对象进行过多地控制,样本数据之间不可避免地具有空间相关性,即空间地理位置较近的个体数据会相互影响,而距离较远的个体数据之间又可能存在较大的特征差异(如在房地产价格研究中,我们要对比不同定价方案对最终收益的影响,同一个楼盘内的不同房屋的价格会相互影响,而不同楼盘的房屋又可能因存在区位、交通、环境、配套设施、户型等方面的特征较大差异而不具有价格可比性)。这种空间相关性和个体差异性给对照实验数据分组造成了极大的困难。However, in many fields of research, it is usually difficult for researchers to control too much on the subjects participating in the experiment, and there is inevitably a spatial correlation between the sample data, that is, individual data with close geographical locations will affect each other, while There may be large characteristic differences between individual data at a long distance (for example, in real estate price research, we want to compare the impact of different pricing schemes on the final income, the prices of different houses in the same real estate will affect each other, and the prices of different houses in the same real estate will affect each other, while Houses in different real estate projects may not be comparable in price due to large differences in location, transportation, environment, supporting facilities, and apartment types.) This spatial correlation and individual variability caused great difficulty in grouping data from controlled experiments.
因此,亟需提供一种能够消除参与实验对象数据和未参与实验对象数据的相互影响,使得个体差异性较大的“对照组”和“实验组”之间具有可比性的对照实验数据分组方案。Therefore, it is urgent to provide a control experiment data grouping scheme that can eliminate the mutual influence of the data of subjects participating in the experiment and the data of subjects not participating in the experiment, so that the "control group" and "experimental group" with large individual differences are comparable .
发明内容Contents of the invention
有鉴于此,本发明提供了一种数据对象的分组方法、装置、计算机设备及存储介质,以解决在很多领域的研究中很难对参与实验的对象进行过多地控制,从而使样本数据之间不可避免地具有空间相关性和个体差异性,给对照实验数据对象分组造成了极大的困难的问题。In view of this, the present invention provides a data object grouping method, device, computer equipment, and storage medium to solve the problem that it is difficult to control the objects participating in the experiment too much in many fields of research, so that the sample data There are inevitably spatial correlations and individual differences among them, which cause extremely difficult problems for the grouping of controlled experimental data objects.
第一方面,本发明提供了一种数据对象的分组方法,该方法包括:基于地理坐标数据,采用空间聚类法将备选数据对象划分为多个数据对象组;基于组间距离,对多个数据对象组中的任意两个数据对象组进行数据对象剔除,得到组间空间不相关的多组数据对象小组;对多组数据对象小组进行随机抽样处理,得到对照数据对象组和实验数据对象组;将对照数据对象组和实验数据对象组进行对照实验处理,并在实验后计算对照数据对象组和实验数据对象组在当前观测点的目标变量差异值;当该目标变量差异值大于差异值阈值或等于差异值阈值时,将对照数据对象组和实验数据对象组中的部分数据对象互换。通过上述过程,可以消除参与实验对象数据和未参与实验对象数据相互影响,使得个体差异性较大的“对照组”和“实验组”之间具有可比性。In a first aspect, the present invention provides a method for grouping data objects, the method comprising: based on geographic coordinate data, using a spatial clustering method to divide candidate data objects into multiple data object groups; Any two data object groups in a data object group are eliminated to obtain multiple groups of data object groups that are not correlated in space between groups; random sampling is performed on multiple groups of data object groups to obtain the control data object group and the experimental data object group group; the control data object group and the experimental data object group are processed in a controlled experiment, and the target variable difference value between the control data object group and the experimental data object group at the current observation point is calculated after the experiment; when the target variable difference value is greater than the difference value When the threshold is equal to or equal to the difference threshold, some data objects in the control data object group and the experimental data object group are exchanged. Through the above process, the mutual influence between the data of subjects participating in the experiment and the data of subjects not participating in the experiment can be eliminated, so that the "control group" and "experimental group" with large individual differences are comparable.
第二方面,本发明提供了一种本发明实施例提供了一种数据对象的分组装置,该评估装置主要包括:数据对象划分模块、数据对象剔除模块、数据对象抽样模块、数据对象实验模块、及数据对象互换模块,其中,数据对象划分模块用于基于地理坐标数据,采用空间聚类法将备选数据对象划分为多个数据对象组;数据对象剔除模块用于基于组间距离,对多个数据对象组中的任意两个数据对象组进行数据对象剔除,得到组间空间不相关的多组数据对象小组;数据对象抽样模块用于对多组数据对象小组进行随机抽样处理,得到对照数据对象组和实验数据对象组;数据对象实验模块用于将对照数据对象组和实验数据对象组进行对照实验处理,并在实验后计算对照数据对象组和实验数据对象组在当前观测点的目标变量差异值;数据对象互换模块用于当目标变量差异值大于差异值阈值或等于差异值阈值时,将对照数据对象组和实验数据对象组中的部分数据对象互换。通过上述过程,可以消除参与实验对象数据和未参与实验对象数据的相互影响,使得个体差异性较大的“对照组”和“实验组”之间具有可比性。In a second aspect, the present invention provides a device for grouping data objects according to an embodiment of the present invention. The evaluation device mainly includes: a data object division module, a data object elimination module, a data object sampling module, a data object experiment module, and the data object exchange module, wherein the data object division module is used to divide the candidate data objects into multiple data object groups based on the geographic coordinate data by adopting the spatial clustering method; Any two data object groups in multiple data object groups are eliminated to obtain multiple groups of data object groups that are not correlated in space between groups; the data object sampling module is used to perform random sampling processing on multiple groups of data object groups to obtain comparison The data object group and the experimental data object group; the data object experiment module is used to process the control data object group and the experimental data object group in a controlled experiment, and calculate the target of the control data object group and the experimental data object group at the current observation point after the experiment The variable difference value; data object exchange module is used to exchange some data objects in the control data object group and the experimental data object group when the target variable difference value is greater than or equal to the difference value threshold. Through the above process, the mutual influence between the data of subjects participating in the experiment and the data of subjects not participating in the experiment can be eliminated, so that the "control group" and "experimental group" with large individual differences are comparable.
第三方面,本发明提供了一种计算机设备,包括:存储器和处理器,存储器和处理器之间互相通信连接,存储器中存储有计算机指令,处理器通过执行计算机指令,从而执行上述第一方面或其对应的任一实施方式的数据对象的分组方法。In a third aspect, the present invention provides a computer device, including: a memory and a processor, the memory and the processor are connected to each other, computer instructions are stored in the memory, and the processor performs the first aspect by executing the computer instructions or the method for grouping data objects in any of the corresponding implementations.
第四方面,本发明提供了一种计算机可读存储介质,该计算机可读存储介质上存储有计算机指令,计算机指令用于使计算机执行上述第一方面或其对应的任一实施方式的数据对象的分组方法。In a fourth aspect, the present invention provides a computer-readable storage medium, where computer instructions are stored on the computer-readable storage medium, and the computer instructions are used to cause a computer to execute the data object in the above-mentioned first aspect or any corresponding implementation manner grouping method.
附图说明Description of drawings
为了更清楚地说明本发明具体实施方式或现有技术中的技术方案,下面将对具体实施方式或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施方式,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the specific implementation of the present invention or the technical solutions in the prior art, the following will briefly introduce the accompanying drawings that need to be used in the specific implementation or description of the prior art. Obviously, the accompanying drawings in the following description The drawings show some implementations of the present invention, and those skilled in the art can obtain other drawings based on these drawings without any creative effort.
图1是本发明实施例的一种应用环境的示意图;FIG. 1 is a schematic diagram of an application environment according to an embodiment of the present invention;
图2是本发明实施例的数据对象的分组方法的流程示意图;2 is a schematic flow diagram of a method for grouping data objects according to an embodiment of the present invention;
图3是本发明实施例的另一数据对象的分组方法的流程示意图;Fig. 3 is a schematic flow chart of another method for grouping data objects according to an embodiment of the present invention;
图4是本发明实施例的数据对象的分组装置的结构框图;Fig. 4 is a structural block diagram of a device for grouping data objects according to an embodiment of the present invention;
图5是本发明实施例的计算机设备的硬件结构示意图。FIG. 5 is a schematic diagram of a hardware structure of a computer device according to an embodiment of the present invention.
具体实施方式Detailed ways
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments It is a part of embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without making creative efforts belong to the protection scope of the present invention.
请参阅图1,图1是本申请实施例提供的一种应用环境的示意图,该示意图包括客户端10和服务器20,其中,服务器10接收到客户端20上传的地理坐标数据及备选数据对象后,可以基于地理坐标数据对备选数据对象进行数据对象分组,最终得到消除空间相关性,以及使个体差异性较大的“对照组”和“实验组”之间具有可比性的对照实验数据分组方案。Please refer to Fig. 1, Fig. 1 is a schematic diagram of an application environment provided by the embodiment of the present application, the schematic diagram includes a
具体的,本申请实施例中,图1所示的传输地理坐标数据及备选数据对象的客户端10可以是用户的智能手机、台式电脑、平板电脑、笔记本电脑、数字助理、智能可穿戴设备等类型的实体设备;其中,智能可穿戴设备可以包括智能手环、智能手表、智能眼镜、智能头盔等。当然,客户端10并不限于上述具有一定实体的电子设备,其还可以为运行于上述电子设备中的软体,例如,客户端10可以为服务商提供给用户的网页页面或应用。Specifically, in the embodiment of the present application, the
可选的,客户端10可以包括通过数据总线相连的显示屏、存储设备和处理器。其中,显示屏用于地理坐标数据及备选数据对象,该显示屏可以是手机或者平板电脑的触摸屏等。该存储设备用于存储地理坐标数据及备选数据对象或者其他数据资料等,该存储设备可以是客户端10的内存,也可以是智能媒体卡(smart media card)、安全数字卡(securedigital card)、快闪存储器卡(flash card)等储存设备。处理器可以是单核或多核处理器。Optionally, the
本申请实施例中,接收地理坐标数据及备选数据对象的可以是如图1所示的服务器20,还可以是和服务器有同样的功能的其他计算机终端,或者类似的运算设备。进一步的,该服务器20可以替换为一个服务器系统、运算平台或者包含多台服务器的服务器集群。In the embodiment of the present application, the
根据本发明实施例,提供了一种数据对象的分组方法实施例,需要说明的是,在附图的流程图示出的步骤可以在诸如一组计算机可执行指令的计算机系统中执行,并且,虽然在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤。According to an embodiment of the present invention, an embodiment of a method for grouping data objects is provided. It should be noted that the steps shown in the flowcharts of the accompanying drawings can be executed in a computer system such as a set of computer-executable instructions, and, Although a logical order is shown in the flowcharts, in some cases the steps shown or described may be performed in an order different from that shown or described herein.
在本实施例中提供了一种数据对象的分组方法,可用于上述的客户端,如智能手机、平板电脑等,图2是根据本发明实施例的数据对象的分组方法的流程图,如图2所示,该流程包括如下步骤:In this embodiment, a method for grouping data objects is provided, which can be used for the above-mentioned clients, such as smart phones, tablet computers, etc. FIG. 2 is a flowchart of a method for grouping data objects according to an embodiment of the present invention, as shown in FIG. 2, the process includes the following steps:
步骤S201,基于地理坐标数据,采用空间聚类法将备选数据对象划分为多个数据对象组。Step S201, based on the geographic coordinate data, the candidate data objects are divided into multiple data object groups by using a spatial clustering method.
本实施例中,首先根据实验目标获取备选数据对象,然后基于地理坐标数据,采用空间聚类法将备选数据对象划分为多个数据对象组,为后续数据对象小组的划分提供数据基础。其中,备选数据对象包括各对象的空间位置及对应的对象信息。即,备选数据对象为具有空间性且能够反映实验目标属性的数据对象。In this embodiment, firstly, the candidate data objects are obtained according to the experimental objectives, and then based on the geographical coordinate data, the candidate data objects are divided into multiple data object groups by using the spatial clustering method, so as to provide a data basis for the subsequent division of the data object groups. Wherein, the candidate data objects include the spatial position of each object and corresponding object information. That is, the candidate data objects are data objects that are spatial and can reflect the attributes of the experimental target.
一种可选的实施方式中,当备选数据对象为m个时,在m个备选数据对象中随机选出2n个数据对象(n的取值需根据具体要研究的问题和备选数据对象的数量m确定),作为初始聚类中心(即确定聚类中心的初始坐标)。如可以从m=10000的备选数据对象中随机抽取2×200=400的数据对象(其中200是m=10000的备选数据对象隶属圈的数量),作为初始聚类中心。计算剩余备选数据对象与各个初始聚类中心的距离,将每个备选数据对象划分到离它最近的类中,从而实现了将备选数据对象划分为对个数据对象组。其中,在计算各个备选数据对象与初始聚类中心距离的过程中,可根据各个备选数据对象的地理坐标数据与初始聚类中的地理坐标数据、以及地球半径计算。In an optional implementation, when there are m candidate data objects, 2n data objects are randomly selected from the m candidate data objects (the value of n needs to be determined according to the specific problem to be studied and the candidate data The number of objects m is determined), as the initial cluster center (that is, to determine the initial coordinates of the cluster center). For example, 2×200=400 data objects can be randomly selected from m=10000 candidate data objects (where 200 is the number of circles to which m=10000 candidate data objects belong), as the initial clustering center. Calculate the distance between the remaining candidate data objects and each initial cluster center, and divide each candidate data object into its nearest class, thus realizing the division of candidate data objects into pairs of data object groups. Wherein, in the process of calculating the distance between each candidate data object and the initial cluster center, it can be calculated according to the geographic coordinate data of each candidate data object, the geographic coordinate data in the initial cluster, and the radius of the earth.
步骤S202,基于组间距离,对多个数据对象组中的任意两个数据对象组进行数据对象剔除,得到组间空间不相关的多组数据对象小组。Step S202, based on the inter-group distance, perform data object elimination on any two data object groups in the plurality of data object groups, to obtain multiple groups of data object groups that are spatially irrelevant between the groups.
本实施例中,在得到多个数据对象组后,为了使各数据对象组之间不具备空间相关性,因此需基于组间距离对多个数据对象组中的任意两个数据对象组进行数据对象剔除,从而得到组间空间不相关的多组数据对象小组。其中,在进行数据对象剔除时,优先剔除数据对象数量较多的组。In this embodiment, after obtaining multiple data object groups, in order to prevent the spatial correlation between each data object group, it is necessary to perform data analysis on any two data object groups in the multiple data object groups based on the inter-group distance. Object culling, resulting in groups of data object groups that are spatially uncorrelated between groups. Wherein, when performing data object elimination, groups with a large number of data objects are preferentially eliminated.
一种可选的实施方式中,对得到的数据对象组,两两计算其最小距离,并剔除导致组间距离小于预先设置的距离阈值(该距离阈值与是否具有空间相关性有关,小于该距离阈值则具有空间相关性,否则不具有空间相关性,需根据研究的问题设定)的数据对象。即,对所有数据对象组中两两循环执行上述剔除过程,直到任意两个实验分组的最小距离都不小于距离阈值,即得到组间空间不相关的多个数据对象小组。In an optional implementation, for the obtained data object groups, calculate their minimum distance two by two, and eliminate the distance between groups that is less than a preset distance threshold (the distance threshold is related to whether there is spatial correlation, less than the distance The threshold has spatial correlation, otherwise it does not have spatial correlation and needs to be set according to the research question). That is, the above-mentioned elimination process is performed in pairs of all data object groups until the minimum distance between any two experimental groups is not less than the distance threshold, that is, multiple data object groups that are spatially uncorrelated between groups are obtained.
举例来讲,当预先设置的距离阈值为1500m。如果得到的400个数据对象组中,有a、b两个组分别包含120个数据对象和150个数据对象,且a组和b组中有两个数据对象:数据对象1和数据表对象2距离最近(假设为1200米),则需要将b组中的数据对象2删除(因为b组包含的数据对象数量更多)。如此循环计算a、b两个组中数据对象距离并进行数据对象删除,直至两个组数据对象的最近距离不小于1500米。For example, when the preset distance threshold is 1500m. If among the obtained 400 data object groups, two groups a and b contain 120 data objects and 150 data objects respectively, and there are two data objects in group a and group b: data object 1 and data table object 2 If the distance is the shortest (assuming 1200 meters), data object 2 in group b needs to be deleted (because group b contains more data objects). Calculate the distance of the data objects in the two groups a and b in this way and delete the data objects until the shortest distance between the data objects in the two groups is not less than 1500 meters.
步骤S203,对多组数据对象小组进行随机抽样处理,得到对照数据对象组和实验数据对象组。In step S203, random sampling is performed on multiple groups of data subject groups to obtain a control data subject group and an experimental data subject group.
本实施例中,当得到多组数据对象小组后,此时的多组数据对象小组之间不具有空间相关性,为了得到对照数据组和实现数据对象组,此时采用随机抽样的方式将多组数据对象小组划分为对照数据对象组和实验数据对象组。In this embodiment, when multiple groups of data object groups are obtained, there is no spatial correlation between the multiple groups of data object groups at this time. Group Data subject groups are divided into control data subject groups and experimental data subject groups.
步骤S204,将对照数据对象组和实验数据对象组进行对照实验处理,并在实验后计算对照数据对象组和实验数据对象组在当前观测点的目标变量差异值。In step S204, the control data object group and the experimental data object group are subjected to a control experiment process, and the difference value of the target variable between the control data object group and the experimental data object group at the current observation point is calculated after the experiment.
本实施例中,当在对多组数据对象小组进行随机抽样处理,得到对照数据对象组和实验数据对象组后,由于个体差异性的存在,此时得到的对照数据对象组和实验数据对象组的目标变量极可能存在差异。因此,需要将对照数据对象组和实验数据对象组进行对照实验处理,并在实验后计算对照数据对象组和实验数据对象组在当前观测点的目标变量差异值。其中,目标变量为实验目标在实验后即下一观测点的取值。In this embodiment, when multiple groups of data object groups are randomly sampled to obtain the control data object group and the experimental data object group, due to the existence of individual differences, the control data object group and the experimental data object group obtained at this time There is a high probability that the target variable of . Therefore, it is necessary to process the control data object group and the experimental data object group in a controlled experiment, and calculate the target variable difference between the control data object group and the experimental data object group at the current observation point after the experiment. Among them, the target variable is the value of the experimental target after the experiment, that is, the next observation point.
一种可选的实施方式中,将2n个数据对象小组,随机分成2组(每组包含n个数据对象小组,如可以分成2组,每组包含n=200个数据对象),分别作为对照数据对象组和实验数据对象组。由于个体差异性的存在,此时得到的对照数据对象组和实验数据对象组的目标变量极可能存在差异。不失一般性,假设在实验处理之后的观测时点时,判断对照数据对象组和实验数据对象组差异的目标变量为组内个体目标变量之和(也可以是平均值、中位数等统计量)。即我们需要对比对照数据对象组和实验数据对象组在当前观测点时组内个体目标变量之和,并将对照数据对象组在当前观测点时组内个体目标变量之和与实验数据对象组在当前观测点时组内个体目标变量之和作差,得到对照数据对象组和实验数据对象组在当前观测点的目标变量差异值。In an optional embodiment, 2n data object groups are randomly divided into 2 groups (each group contains n data object groups, such as can be divided into 2 groups, each group contains n=200 data objects), respectively as a control Data Object Group and Experiment Data Object Group. Due to the existence of individual differences, the target variables of the control data subject group and the experimental data subject group obtained at this time are likely to be different. Without loss of generality, it is assumed that at the observation time point after the experimental treatment, the target variable for judging the difference between the control data subject group and the experimental data subject group is the sum of the individual target variables in the group (it can also be the mean, median, etc. quantity). That is to say, we need to compare the sum of the individual target variables in the group when the control data object group and the experimental data object group are at the current observation point, and compare the sum of the individual target variables in the group when the control data object group is at the current observation point with the experimental data object group at The difference between the sum of the individual target variables in the group at the current observation point is obtained to obtain the target variable difference between the control data object group and the experimental data object group at the current observation point.
步骤S205,当目标变量差异值大于差异值阈值或等于差异值阈值时,将对照数据对象组和实验数据对象组中的部分数据对象互换。Step S205, when the difference value of the target variable is greater than or equal to the difference threshold, some data objects in the control data object group and the experimental data object group are exchanged.
本实施例中,为减小对照数据对象组和实验数据对象组在对照实验中的差异性,因此在对照实验后判断目标变量差异值与差异值阈值的关系,并在目标变量差异值大于差异值阈值或等于差异值阈值时,将对照数据对象组和实验数据对象组中的部分数据对象互换。其中,将对照数据对象组和实验数据对象组中的部分数据对象互换的过程中,可以是将对照数据对象组和实验数据对象组中的一个对象互换,也可以为多个对象的互换。In this embodiment, in order to reduce the difference between the control data object group and the experimental data object group in the control experiment, the relationship between the target variable difference value and the difference value threshold is judged after the control experiment, and when the target variable difference value is greater than the difference When the value threshold is equal to or equal to the difference value threshold, some data objects in the control data object group and the experimental data object group are exchanged. Wherein, in the process of exchanging some data objects in the control data object group and the experimental data object group, it may be an object in the control data object group and the experimental data object group, or it may be an exchange of multiple objects. Change.
本实施例提供的数据对象的分组方法,通过基于地理坐标数据,采用空间聚类法将备选数据对象划分为多个数据对象组,以为后续数据对象小组的划分提供数据基础;同时基于组间距离,对多个数据对象组中的任意两个数据对象组进行数据对象剔除,以得到组间空间不相关的多组数据对象小组;通过对多组数据对象小组进行随机抽样处理,以得到对照数据对象组和实验数据对象组;将对照数据对象组和实验数据对象组进行对照实验处理,并在实验后计算对照数据对象组和实验数据对象组在当前观测点的目标变量差异值,当目标变量差异值大于差异值阈值或等于差异值阈值时,将对照数据对象组和实验数据对象组中的部分数据对象互换,以减小对照数据对象组和实验数据对象组在对照实验中的差异性,从而实现消除参与实验对象数据和未参与实验对象数据的相互影响,使得个体差异性较大的“对照组”和“实验组”之间具有可比性。The method for grouping data objects provided in this embodiment divides candidate data objects into multiple data object groups by using spatial clustering based on geographical coordinate data, so as to provide a data basis for the division of subsequent data object groups; Distance, remove data objects from any two data object groups in multiple data object groups to obtain multiple groups of data object groups that are not spatially correlated between groups; perform random sampling processing on multiple groups of data object groups to obtain comparison The data object group and the experimental data object group; the control data object group and the experimental data object group are processed in the control experiment, and the difference value of the target variable between the control data object group and the experimental data object group at the current observation point is calculated after the experiment. When the target When the variable difference value is greater than the difference value threshold or equal to the difference value threshold, some data objects in the control data object group and the experimental data object group are exchanged to reduce the difference between the control data object group and the experimental data object group in the control experiment In order to eliminate the interaction between the data of subjects participating in the experiment and the data of subjects not participating in the experiment, the "control group" and "experimental group" with large individual differences are comparable.
在本实施例中提供了一种数据对象的分组方法,可用于上述的移动终端,如智能手机、平板电脑等,图3是根据本发明实施例的数据对象的分组方法的流程图,如图3所示,该流程包括如下步骤:In this embodiment, a method for grouping data objects is provided, which can be used for the above-mentioned mobile terminals, such as smart phones, tablet computers, etc. FIG. 3 is a flowchart of a method for grouping data objects according to an embodiment of the present invention, as shown in FIG. 3, the process includes the following steps:
步骤S301,基于地理坐标数据,采用空间聚类法将备选数据对象划分为多个数据对象组。Step S301, based on the geographic coordinate data, the candidate data objects are divided into multiple data object groups by using a spatial clustering method.
具体地,上述步骤S301包括:Specifically, the above step S301 includes:
步骤S3011,从备选数据对象中随机抽取多个对象作为初始聚类中心。In step S3011, a plurality of objects are randomly selected from candidate data objects as initial clustering centers.
一种可选的实施方式中,当备选数据对象为m个时,在m个备选数据对象中随机选出2n个数据对象(n的取值需根据具体要研究的问题和备选数据对象的数量m确定),作为初始聚类中心(即确定聚类中心的初始坐标)。如可以从m=10000的备选数据对象中随机抽取2×200=400的数据对象(其中200是m=10000的备选数据对象隶属圈的数量),作为初始聚类中心。In an optional implementation, when there are m candidate data objects, 2n data objects are randomly selected from the m candidate data objects (the value of n needs to be determined according to the specific problem to be studied and the candidate data The number of objects m is determined), as the initial cluster center (that is, to determine the initial coordinates of the cluster center). For example, 2×200=400 data objects can be randomly selected from m=10000 candidate data objects (where 200 is the number of circles to which m=10000 candidate data objects belong), as the initial clustering center.
步骤S3012,计算剩余的备选数据对象到每个初始聚类中心的距离,并将备选数据对象划分到对应初始聚类中心的类中。Step S3012, calculate the distance from the remaining candidate data objects to each initial cluster center, and divide the candidate data objects into the classes corresponding to the initial cluster centers.
一种可选的实施方式中,首先获取备选数据对象中各数据对象的坐标经纬度、各初始聚类中心的坐标经纬度、和地球半径;然后基于球面距离计算函数,以及备选数据对象中各数据对象的坐标经纬度、各初始聚类中心的坐标经纬度、和地球半径,创建聚类距离计算模型;最后通过聚类距离计算模型,计算剩余的备选数据对象到每个初始聚类中心的距离。In an optional implementation, first obtain the coordinate latitude and longitude of each data object in the candidate data object, the coordinate longitude and latitude of each initial cluster center, and the radius of the earth; then calculate the function based on the spherical distance, and each of the candidate data objects The coordinate longitude and latitude of the data object, the coordinate longitude and latitude of each initial clustering center, and the radius of the earth, create a clustering distance calculation model; finally, calculate the distance from the remaining candidate data objects to each initial clustering center through the clustering distance calculation model .
聚类距离计算模型的表达式为:The expression of the clustering distance calculation model is:
d=distCosin(lngi,lati,lng O,lat O,r)d=distCosin(lng i ,lat i ,lng O ,lat O ,r)
其中,distCosin为球面距离计算函数,lngi为备选数据对象i的坐标经度,lati为备选数据对象i的坐标纬度,lngo为第o个类的聚类中心坐标经度,lato为第o个类的聚类中心坐标纬度,r为地球半径,默认为6378137米。Among them, distCosin is the spherical distance calculation function, lng i is the coordinate longitude of the candidate data object i, lat i is the coordinate latitude of the candidate data object i, lng o is the cluster center coordinate longitude of the oth class, and lat o is The latitude of the cluster center coordinates of the oth class, r is the radius of the earth, and the default is 6378137 meters.
步骤S3013,基于划分结果更新每个类的聚类中心,形成多个数据对象组。Step S3013, update the cluster center of each class based on the division result to form multiple data object groups.
一种可选的实施方式中,获取当前类中各备选数据对象的经度数据,并将各备选数据对象的经度数据求和后除以当前类中包含的备选数据对象数量,得到当前类的聚类中心坐标经度;获取当前类中各备选数据对象的纬度数据,并将各备选数据对象的纬度数据求和后除以当前类中包含的备选数据对象数量,得到当前类的聚类中心坐标纬度;根据当前类的聚类中心坐标经度和聚类中心坐标纬度,更新当前类的聚类中心。重复步骤S3012至步骤S3013,直至每个类的聚类中心都不再发生改变时结束,得到多个数据对象组。In an optional implementation manner, the longitude data of each candidate data object in the current class is obtained, and the longitude data of each candidate data object are summed and divided by the number of candidate data objects contained in the current class to obtain the current The longitude of the cluster center coordinates of the class; obtain the latitude data of each candidate data object in the current class, and divide the latitude data of each candidate data object by the number of candidate data objects contained in the current class to obtain the current class The latitude and longitude of the cluster center coordinates of the current class; update the cluster center of the current class according to the longitude and latitude of the cluster center coordinates of the current class. Step S3012 to step S3013 are repeated until the cluster center of each class does not change, and a plurality of data object groups are obtained.
当前类的聚类中心的表达式为:The expression of the cluster center of the current class is:
其中,lngo为第o个类的聚类中心坐标经度;lato为第o个类的聚类中心坐标纬度;No为第o个类中包含的数据对象数量;lngi为属于第o个类的数据对象i的坐标经度;lati为属于第o个类的数据对象i的坐标纬度。Among them, lng o is the longitude of the clustering center coordinates of the oth class; lat o is the latitude and longitude of the clustering center coordinates of the oth class; N o is the number of data objects contained in the oth class; lng i is the number of data objects belonging to the oth class The coordinate longitude of the data object i of the first class; lat i is the coordinate latitude of the data object i belonging to the oth class.
步骤S302,基于组间距离,对多个数据对象组中的任意两个数据对象组进行数据对象剔除,得到组间空间不相关的多组数据对象小组。Step S302, based on the inter-group distance, perform data object elimination on any two data object groups in the plurality of data object groups, to obtain multiple groups of data object groups that are spatially irrelevant between the groups.
具体地,首先获取第一对照数据对象组中各数据对象的坐标经纬度、第二对照数据对象组中各数据对象的坐标经纬度、和地球半径;然后基于球面距离计算函数,以及第一对照数据对象组中各数据对象的坐标经纬度、第二对照数据对象组中各数据对象的坐标经纬度、和地球半径,创建组间距离计算模型;通过组间距离计算模型,计算第一对照数据对象组与第二对照数据对象组中距离最近的两个数据对象之间的距离。当第一对照数据对象组与第二对照数据对象组中距离最近的两个数据对象之间的距离小于距离阈值时,删除将第一对照数据对象组与第二对照数据对象组中距离最近的两个数据对象中的一个对象,从而得到组间空间不相关的多组数据对象小组。Specifically, first obtain the coordinate longitude and latitude of each data object in the first comparison data object group, the coordinate longitude and latitude of each data object in the second comparison data object group, and the radius of the earth; then calculate the function based on the spherical distance, and the first comparison data object The coordinate longitude and latitude of each data object in the group, the coordinate longitude and latitude of each data object in the second comparison data object group, and the radius of the earth, create a distance calculation model between groups; through the distance calculation model between groups, calculate the first comparison data object group and the second comparison data object group The distance between the two closest data objects in the two-comparison data object group. When the distance between the two closest data objects in the first comparison data object group and the second comparison data object group is less than the distance threshold, delete the first comparison data object group and the second comparison data object group. One of two data objects, resulting in multiple groups of data object subgroups that are spatially uncorrelated between groups.
一种可选的实施方式中,对于得到的数据对象组,两两计算其最小距离,并剔除导致组间距离小于预先设置的距离阈值D(该距离阈值与是否具有空间相关性有关,小于该距离阈值则具有空间相关性,否则不具有空间相关性,需根据研究的问题设定)的数据对象。例如,对o1和o2两个数据对象组,其最小距离是两组之间距离最近的数据对象的距离:In an optional implementation, for the obtained data object groups, calculate the minimum distance two by two, and eliminate the distance between the groups that is less than the preset distance threshold D (the distance threshold is related to whether there is spatial correlation, less than the The distance threshold has spatial correlation, otherwise it does not have spatial correlation and needs to be set according to the research question). For example, for two data object groups o 1 and o 2 , the minimum distance is the distance of the closest data object between the two groups:
其中,distCosin为球面距离计算函数,lngi和lati分别为o1组中的数据对象i的坐标经度、纬度,lngj和latj分别为o2组中的数据对象j的坐标经度、纬度,r为地球半径,默认为6378137米。Among them, distCosin is the spherical distance calculation function, lng i and lat i are the coordinate longitude and latitude of data object i in group o 1 respectively, lng j and lat j are the coordinate longitude and latitude of data object j in group o 2 respectively , r is the radius of the earth, the default is 6378137 meters.
如果小于预先设置的距离阈值D(该距离阈值与是否具有空间相关性有关,小于该距离阈值则具有空间相关性,否则不具有空间相关性,需根据研究的问题设定),则将o1和o2两个数据对象组之间距离最近的2个数据对象删除1个(优先删除所属分组中包含数据对象数量较多的那个)。即若if is less than the preset distance threshold D (this distance threshold is related to whether there is spatial correlation, if it is smaller than this distance threshold, it has spatial correlation, otherwise it does not have spatial correlation, it needs to be set according to the research problem), then o 1 and o 2 Delete one of the two data objects with the closest distance between the two data object groups (priority to delete the one that contains more data objects in the group to which it belongs). That is if
且and
则剔除j。其中No1和No2分别为第o1和o2两数据对象组中包含的数据对象数量。Then remove j. Where N o1 and N o2 are respectively the number of data objects included in the o1th and o2th data object groups.
对所有的数据对象组两两循环执行上述剔除过程,直到任意两个数据对象组的最小距离都不小于D,即得到组间空间不相关的2n个数据象小组。Perform the above elimination process in pairs for all data object groups, until the minimum distance between any two data object groups is not less than D, that is, 2n data object groups that are not spatially correlated between groups are obtained.
步骤S303,对多组数据对象小组进行随机抽样处理,得到对照数据对象组和实验数据对象组。In step S303, random sampling is performed on multiple groups of data subject groups to obtain a control data subject group and an experimental data subject group.
详细请参见图2所示实施例的步骤S203,在此不再赘述。For details, please refer to step S203 in the embodiment shown in FIG. 2 , which will not be repeated here.
步骤S304,将对照数据对象组和实验数据对象组进行对照实验处理,并在实验后计算对照数据对象组和实验数据对象组在当前观测点的目标变量差异值。In step S304, the control data object group and the experimental data object group are subjected to a control experiment process, and the target variable difference value between the control data object group and the experimental data object group at the current observation point is calculated after the experiment.
具体地,上述步骤S304包括:Specifically, the above step S304 includes:
步骤S3041,获取对照数据对象组的目标变量在当前观测点的取值,以及实验数据对象组的目标变量在当前观测点的取值。Step S3041, obtaining the value of the target variable of the control data object group at the current observation point, and the value of the target variable of the experimental data object group at the current observation point.
一种可选的实施方式中,假设在实验处理之后的观测时点s+1时,判断对照数据对象组和实验数据对象组的目标变量为组内个体目标变量之和(也可以是平均值、中位数等统计量)。即我们需要对比:In an optional implementation, it is assumed that at the observation time point s+1 after the experimental treatment, it is judged that the target variable of the control data subject group and the experimental data subject group is the sum of the individual target variables in the group (it can also be the mean value , median and other statistics). That is, we need to compare:
YA,s+1=∑i∈Ayi,s+1 Y A,s+1 =∑ i∈A y i,s+1
YB,s+1=∑j∈Byj,s+1 Y B,s+1 =∑ j∈B y j,s+1
其中,A为实验数据对象组;B为对照数据对象组;yi,s+1(i∈A)表示属于实验数据对象组的数据对象的目标变量在观测时点s+1时的取值;yj,s+1(j∈B)表示属于对照数据对象组的数据对象的目标变量在观测时点s+1时的取值;YA,s+1表示实验数据对象组的目标变量在观测时点s+1时的取值;YB,s+1表示对照数据对象组的目标变量在观测时点s+1时的取值。同理,我们可以计算历史时期t(t=1,2,...,s)时的对照数据对象组和实验数据对象组的目标变量取值:Among them, A is the experimental data object group; B is the control data object group; y i, s+1 (i∈A) represents the value of the target variable of the data object belonging to the experimental data object group at the observation time point s+1 ; y j, s+1 (j∈B) represents the value of the target variable of the data object belonging to the control data object group at the observation time point s+1; Y A, s+1 represents the target variable of the experimental data object group The value at the observation time point s+1; Y B, s+1 represents the value of the target variable of the control data object group at the observation time point s+1. Similarly, we can calculate the target variable values of the control data object group and the experimental data object group in the historical period t(t=1,2,...,s):
YA,t=∑i∈Ayi,t Y A,t =∑ i∈A y i,t
YB,t=∑j∈Byj,t Y B,t =∑ j∈B y j,t
其中,A为实验数据对象组;B为对照数据对象组;yi,t(i∈A)表示属于实验数据对象组的数据对象的目标变量在观测时点t(t=1,2,...,s)时的取值;yj,t(j∈B)表示属于对照数据对象组的数据对象的目标变量在观测时点t(t=1,2,...,s)时的取值;YA,s+1表示实验数据对象组的目标变量在观测时点t(t=1,2,...,s)时的取值;YB,s+1表示对照数据对象组的目标变量在观测时点t(t=1,2,...,s)时的取值。Among them, A is the experimental data object group; B is the control data object group; y i,t (i∈A) represents the target variable of the data object belonging to the experimental data object group at the observation time point t (t=1,2,. ..,s) when the value is taken; y j,t (j∈B) indicates that the target variable of the data object belonging to the control data object group is at the observation time point t(t=1,2,...,s) Y A,s+1 represents the value of the target variable of the experimental data object group at the observation time point t(t=1,2,...,s); Y B,s+1 represents the control data The value of the target variable of the object group at the observation time point t (t=1,2,...,s).
步骤S3042,根据对照数据对象组的目标变量在当前观测点的取值与实验数据对象组的目标变量在当前观测点的取值的差值,得到对照数据对象组和实验数据对象组在当前观测点的目标变量差异值。Step S3042, according to the difference between the value of the target variable of the control data object group at the current observation point and the value of the target variable of the experimental data object group at the current observation point, obtain the control data object group and the experimental data object group at the current observation point The target variable difference value for the point.
一种可选的实施方式中,计算每个时期实验数据对象组和对照数据对象组目标变量取值的最大差异:In an optional implementation, the maximum difference in the value of the target variable between the experimental data subject group and the control data subject group in each period is calculated:
如果ΔY大于预设的阈值Δ(该阈值需根据研究的问题具体设定),则从对照数据对象组和实验数据对象组中各抽取一个数据对象小组互换组别并重新计算ΔY,直至该ΔY小于预设的阈值Δ,即得到可用于对照实验的对照数据对象组和实验数据对象组。If ΔY is greater than the preset threshold Δ (the threshold needs to be specifically set according to the research question), then extract a data subject group from the control data subject group and the experimental data subject group to exchange groups and recalculate ΔY until the If ΔY is less than the preset threshold Δ, the control data object group and the experimental data object group that can be used in the control experiment are obtained.
举例来讲,某个城市有m=10000套用于出租的房源作为实验对象,每套房源的经纬度坐标(lngi,lati)、历史租金收入yi,t已知,我们的目的是要从这批房源中选出实验数据对象组和对照数据对象组两组房源,分别采用A、B两种不同定价方案,并观察不同定价方案下两组房源租金收入yi,s+1的差异水平。For example, a certain city has m=10000 houses for rent as experimental objects. The latitude and longitude coordinates (lng i , lat i ) and historical rental income y i,t of each house are known. Our purpose is To select two groups of houses in the experimental data object group and the control data object group from this batch of houses, adopt two different pricing schemes A and B respectively, and observe the rental income of the two groups of houses under different pricing schemes y i,s +1 for the level of variance.
首先,可以从m=10000套房源中随机抽取2×200=400套房源(其中200是10000套房源隶属的商圈的数量),作为初始聚类中心。计算剩余房源到每个聚类中心的距离,并将每个房源划分到离它距离最近的类中,基于划分结果更新每个类的聚类中心。First, 2×200=400 suites can be randomly selected from m=10,000 suites (where 200 is the number of business districts to which 10,000 suites belong), as the initial clustering center. Calculate the distance from the remaining listings to each cluster center, and divide each listing into the class closest to it, and update the clustering center of each class based on the division results.
其次,可以设置距离阈值D=1500米。如果得到的400个房源分组中,有a、b两个组分别包含120套和150套房源,且a组和b组中有两套房源——房源1和房源2距离最近(假设为1200米),则需要将b组中的房源2删除(因为b组包含的房源数量更多)。如此循环计算a、b两个组中房源的距离并进行房源删除,直至两个组房源的最近距离不小于D=1500米。Secondly, a distance threshold D=1500 meters can be set. If in the obtained 400 listing groups, there are two groups a and b containing 120 and 150 listings respectively, and there are two listings in group a and group b—house 1 and house 2 are the closest (assuming it is 1200 meters), you need to delete listing 2 in group b (because group b contains more listings). Calculate the distance between the listings in groups a and b in this way and delete the listings until the shortest distance between the listings in the two groups is not less than D=1500 meters.
最后,可以设置差异值阈值Δ=50000元。如果得到实验数据对象组和对照数据对象组,在历史各月份的总租金收入最大差异超过50000元,则分别从实验数据对象组和对照数据对象组(各含n=200组房源)中抽取1组房源进行组别交换,直至在历史各月份实验数据对象组和对照数据对象组的总租金收入最大差异不超过50000元。这时,对实验数据对象组和对照数据对象组分别采用A、B两种定价方案,之后观察实验数据对象组和对照数据对象组的月总租金收入,如果差异显著高于50000元或高于实验数据对象组和对照数据对象组历史各月份总租金收入差异的3倍标准差,则说明A、B两种定价方案有显著差异,其中总租金收入高的方案显著优于总租金收入低的方案。Finally, the difference value threshold Δ=50000 yuan can be set. If the experimental data object group and the control data object group are obtained, and the maximum difference in total rental income in each historical month exceeds 50,000 yuan, then extract from the experimental data object group and the control data object group (each containing n=200 groups of house sources) Group 1 houses are exchanged until the maximum difference in total rental income between the experimental data subject group and the control data subject group in each historical month does not exceed 50,000 yuan. At this time, adopt two pricing schemes A and B for the experimental data subject group and the control data subject group respectively, and then observe the total monthly rental income of the experimental data subject group and the control data subject group. If the difference is significantly higher than 50,000 yuan or higher than Three times the standard deviation of the difference between the experimental data object group and the control data object group in the total rental income of each month, it shows that there are significant differences between the two pricing schemes A and B, and the scheme with high total rental income is significantly better than the one with low total rental income plan.
步骤S305,当目标变量差异值大于差异值阈值或等于差异值阈值时,将对照数据对象组和实验数据对象组中的部分数据对象互换。Step S305, when the difference value of the target variable is greater than or equal to the difference threshold, some data objects in the control data object group and the experimental data object group are exchanged.
详细请参见图2所示实施例的步骤S205,在此不再赘述。For details, please refer to step S205 in the embodiment shown in FIG. 2 , which will not be repeated here.
本实施例提供的数据对象的分组方法,通过基于地理坐标数据,采用空间聚类法将备选数据对象划分为多个数据对象组,以为后续数据对象小组的划分提供数据基础;同时基于组间距离,对多个数据对象组中的任意两个数据对象组进行数据对象剔除,以得到组间空间不相关的多组数据对象小组;通过对多组数据对象小组进行随机抽样处理,以得到对照数据对象组和实验数据对象组;将对照数据对象组和实验数据对象组进行对照实验处理,并在实验后计算对照数据对象组和实验数据对象组在当前观测点的目标变量差异值,当目标变量差异值大于差异值阈值或等于差异值阈值时,将对照数据对象组和实验数据对象组中的部分数据对象互换,以减小对照数据对象组和实验数据对象组在对照实验中的差异性,从而实现消除参与实验对象数据和未参与实验对象数据的相互影响,控制了目标变量的历史数据在对照数据对象组和实验数据对象组之间的差异,使得个体差异性较大的“对照组”和“实验组”之间具有可比性。The method for grouping data objects provided in this embodiment divides candidate data objects into multiple data object groups by using spatial clustering based on geographical coordinate data, so as to provide a data basis for the division of subsequent data object groups; Distance, remove data objects from any two data object groups in multiple data object groups to obtain multiple groups of data object groups that are not spatially correlated between groups; perform random sampling processing on multiple groups of data object groups to obtain comparison The data object group and the experimental data object group; the control data object group and the experimental data object group are processed in the control experiment, and the difference value of the target variable between the control data object group and the experimental data object group at the current observation point is calculated after the experiment. When the target When the variable difference value is greater than the difference value threshold or equal to the difference value threshold, some data objects in the control data object group and the experimental data object group are exchanged to reduce the difference between the control data object group and the experimental data object group in the control experiment In order to eliminate the interaction between the data of subjects participating in the experiment and the data of subjects not participating in the experiment, and to control the difference of the historical data of the target variable between the control data subject group and the experimental data subject group, so that the "control group" with large individual differences group" and "experimental group" are comparable.
在本实施例中还提供了一种数据对象的分组装置,该装置用于实现上述实施例及优选实施方式,已经进行过说明的不再赘述。如以下所使用的,术语“模块”可以实现预定功能的软件和/或硬件的组合。尽管以下实施例所描述的装置较佳地以软件来实现,但是硬件,或者软件和硬件的组合的实现也是可能并被构想的。In this embodiment, a device for grouping data objects is also provided, and the device is used to implement the above embodiments and preferred implementation modes, and those that have already been described will not be repeated. As used below, the term "module" may be a combination of software and/or hardware that realizes a predetermined function. Although the devices described in the following embodiments are preferably implemented in software, implementations in hardware, or a combination of software and hardware are also possible and contemplated.
本实施例提供一种数据对象的分组装置,如图4所示,包括:This embodiment provides a device for grouping data objects, as shown in Figure 4, including:
数据对象划分模块401,用于基于地理坐标数据,采用空间聚类法将备选数据对象划分为多个数据对象组。The data object
在一些可选的实施方式中,数据对象划分模块401包括:In some optional implementation manners, the data
对象抽取单元,用于从备选数据对象中随机抽取多个数据对象作为初始聚类中心。The object extraction unit is used to randomly extract multiple data objects from candidate data objects as initial cluster centers.
对象划分单元,用于计算剩余的备选数据对象到每个初始聚类中心的距离,并将备选数据对象划分到对应初始聚类中心的类中。The object division unit is used to calculate the distance from the remaining candidate data objects to each initial cluster center, and divide the candidate data objects into the classes corresponding to the initial cluster centers.
具体的,首先获取备选数据对象中各数据对象的坐标经纬度、各初始聚类中心的坐标经纬度、和地球半径;然后基于球面距离计算函数,以及备选数据对象中各数据对象的坐标经纬度、各初始聚类中心的坐标经纬度、和地球半径,创建聚类距离计算模型;最后通过聚类距离计算模型,计算剩余的备选数据对象到每个初始聚类中心的距离。Specifically, first obtain the coordinate longitude and latitude of each data object in the candidate data object, the coordinate longitude and latitude of each initial cluster center, and the radius of the earth; then calculate the function based on the spherical distance, and the coordinate longitude and latitude of each data object in the candidate data object, The coordinate longitude and latitude of each initial cluster center, and the radius of the earth, create a cluster distance calculation model; finally, calculate the distance from the remaining candidate data objects to each initial cluster center through the cluster distance calculation model.
聚类更新单元,用于基于划分结果更新每个类的聚类中心,形成多个数据对象组。The cluster update unit is configured to update the cluster center of each class based on the division result to form multiple data object groups.
具体的,首先获取当前类中各备选数据对象的经度数据,并将各备选数据对象的经度数据求和后除以当前类中包含的备选数据对象数量,得到当前类的聚类中心坐标经度;然后获取当前类中各备选数据对象的纬度数据,并将各备选数据对象的纬度数据求和后除以当前类中包含的备选数据对象数量,得到当前类的聚类中心坐标纬度;最后根据当前类的聚类中心坐标经度和聚类中心坐标纬度,更新当前类的聚类中心。Specifically, first obtain the longitude data of each candidate data object in the current class, and divide the longitude data of each candidate data object by the number of candidate data objects contained in the current class to obtain the cluster center of the current class Coordinate longitude; then obtain the latitude data of each candidate data object in the current class, and divide the latitude data of each candidate data object by the number of candidate data objects contained in the current class to obtain the cluster center of the current class Coordinate latitude; finally, update the cluster center of the current class according to the coordinate longitude and latitude of the cluster center coordinates of the current class.
数据对象剔除模块402,用于基于组间距离,对多个数据对象组中的任意两个数据对象组进行数据对象剔除,得到组间空间不相关的多组数据对象小组。The data
在一些可选的实施方式中,数据对象剔除模块402包括:In some optional implementation manners, the data
第一数据获取单元,用于获取第一对照数据对象组中各数据对象的坐标经纬度、第二对照数据对象组中各数据对象的坐标经纬度、和地球半径;The first data acquisition unit is used to acquire the coordinate longitude and latitude of each data object in the first comparison data object group, the coordinate longitude and latitude of each data object in the second comparison data object group, and the radius of the earth;
第二数据获取单元,用于基于球面距离计算函数,以及第一对照数据对象组中各数据对象的坐标经纬度、第二对照数据对象组中各数据对象的坐标经纬度、和地球半径,创建组间距离计算模型;The second data acquisition unit is used to calculate the function based on the spherical distance, and the coordinate longitude and latitude of each data object in the first comparison data object group, the coordinate longitude and latitude of each data object in the second comparison data object group, and the radius of the earth to create a group space distance calculation model;
距离计算单元,用于通过组间距离计算模型,计算第一对照数据对象组与第二对照数据对象组中距离最近的两个数据对象之间的距离;The distance calculation unit is used to calculate the distance between the two closest data objects in the first comparison data object group and the second comparison data object group through the inter-group distance calculation model;
对象删除单元,用于当第一对照数据对象组与第二对照数据对象组中距离最近的两个数据对象之间的距离小于距离阈值时,删除将第一对照数据对象组与第二对照数据对象组中距离最近的两个数据对象中的一个数据对象。The object deletion unit is used to delete the first comparison data object group and the second comparison data object group when the distance between the two closest data objects in the first comparison data object group and the second comparison data object group is less than the distance threshold. One of the two closest data objects in the object group.
数据对象抽样模块403,用于对多组数据对象小组进行随机抽样处理,得到对照数据对象组和实验数据对象组。The data object
数据对象实验模块404,用于将对照数据对象组和实验数据对象组进行对照实验处理,并在实验后计算对照数据对象组和实验数据对象组在当前观测点的目标变量差异值。The data object
在一些可选的实施方式中,数据对象实验模块404包括:In some optional embodiments, the data object
观测点取值获取单元,获取对照数据对象组的目标变量在当前观测点的取值,以及实验数据对象组的目标变量在当前观测点的取值;The observation point value acquisition unit obtains the value of the target variable of the control data object group at the current observation point, and the value of the target variable of the experimental data object group at the current observation point;
观测点差值计算单元,用于根据对照数据对象组的目标变量在当前观测点的取值与实验数据对象组的目标变量在当前观测点的取值的差值,得到对照数据对象组和实验数据对象组在当前观测点的目标变量差异值。The observation point difference calculation unit is used to obtain the control data object group and the experimental The target variable difference value of the data object group at the current observation point.
数据对象互换模块405,用于当目标变量差异值大于差异值阈值或等于差异值阈值时,将对照数据对象组和实验数据对象组中的部分数据对象互换,然后将对象互换的照数据对象组和实验数据对象组进行对照实验。The data object
本实施例中的数据对象的分组装置是以功能单元的形式来呈现,这里的单元是指ASIC电路,执行一个或多个软件或固定程序的处理器和存储器,和/或其他可以提供上述功能的器件。The device for grouping data objects in this embodiment is presented in the form of a functional unit, where a unit refers to an ASIC circuit, a processor and a memory that execute one or more software or fixed programs, and/or other devices that can provide the above functions device.
上述各个模块和单元的更进一步的功能描述与上述对应实施例相同,在此不再赘述。The further functional descriptions of the above-mentioned modules and units are the same as those of the above-mentioned corresponding embodiments, and will not be repeated here.
本发明实施例还提供一种计算机设备,具有上述图4所示的数据对象的分组装置。An embodiment of the present invention also provides a computer device, which has the apparatus for grouping data objects shown in FIG. 4 above.
请参阅图5,图5是本发明可选实施例提供的一种计算机设备的结构示意图,如图5所示,该计算机设备包括:一个或多个处理器10、存储器20,以及用于连接各部件的接口,包括高速接口和低速接口。各个部件利用不同的总线互相通信连接,并且可以被安装在公共主板上或者根据需要以其它方式安装。处理器可以对在计算机设备内执行的指令进行处理,包括存储在存储器中或者存储器上以在外部输入/输出装置(诸如,耦合至接口的显示设备)上显示GUI的图形信息的指令。在一些可选的实施方式中,若需要,可以将多个处理器和/或多条总线与多个存储器和多个存储器一起使用。同样,可以连接多个计算机设备,各个设备提供部分必要的操作(例如,作为服务器阵列、一组刀片式服务器、或者多处理器系统)。图5中以一个处理器10为例。Please refer to FIG. 5. FIG. 5 is a schematic structural diagram of a computer device provided in an optional embodiment of the present invention. As shown in FIG. 5, the computer device includes: one or
处理器10可以是中央处理器,网络处理器或其组合。其中,处理器10还可以进一步包括硬件芯片。上述硬件芯片可以是专用集成电路,可编程逻辑器件或其组合。上述可编程逻辑器件可以是复杂可编程逻辑器件,现场可编程逻辑门阵列,通用阵列逻辑或其任意组合。
其中,存储器20存储有可由至少一个处理器10执行的指令,以使至少一个处理器10执行实现上述实施例示出的方法。Wherein, the
存储器20可以包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需要的应用程序;存储数据区可存储根据一种小程序落地页的展现的计算机设备的使用所创建的数据等。此外,存储器20可以包括高速随机存取存储器,还可以包括非瞬时存储器,例如至少一个磁盘存储器件、闪存器件、或其他非瞬时固态存储器件。在一些可选的实施方式中,存储器20可选包括相对于处理器10远程设置的存储器,这些远程存储器可以通过网络连接至该计算机设备。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The
存储器20可以包括易失性存储器,例如,随机存取存储器;存储器也可以包括非易失性存储器,例如,快闪存储器,硬盘或固态硬盘;存储器20还可以包括上述种类的存储器的组合。The
该计算机设备还包括通信接口30,用于该计算机设备与其他设备或通信网络通信。The computer device also includes a
本发明实施例还提供了一种计算机可读存储介质,上述根据本发明实施例的方法可在硬件、固件中实现,或者被实现为可记录在存储介质,或者被实现通过网络下载的原始存储在远程存储介质或非暂时机器可读存储介质中并将被存储在本地存储介质中的计算机代码,从而在此描述的方法可被存储在使用通用计算机、专用处理器或者可编程或专用硬件的存储介质上的这样的软件处理。其中,存储介质可为磁碟、光盘、只读存储记忆体、随机存储记忆体、快闪存储器、硬盘或固态硬盘等;进一步地,存储介质还可以包括上述种类的存储器的组合。可以理解,计算机、处理器、微处理器控制器或可编程硬件包括可存储或接收软件或计算机代码的存储组件,当软件或计算机代码被计算机、处理器或硬件访问且执行时,实现上述实施例示出的方法。The embodiment of the present invention also provides a computer-readable storage medium. The above-mentioned method according to the embodiment of the present invention can be implemented in hardware or firmware, or can be recorded in a storage medium, or can be downloaded through the network for original storage. Computer code on a remote storage medium or a non-transitory machine-readable storage medium to be stored on a local storage medium so that the methods described herein can be stored on a computer using a general purpose computer, a special purpose processor, or programmable or dedicated hardware Such software processing on storage media. Wherein, the storage medium may be a magnetic disk, an optical disk, a read-only memory, a random access memory, a flash memory, a hard disk or a solid-state hard disk, etc.; further, the storage medium may also include a combination of the above types of memories. It can be understood that a computer, processor, microprocessor controller or programmable hardware includes a storage component that can store or receive software or computer code, and when the software or computer code is accessed and executed by the computer, processor or hardware, the above-mentioned implementation Example method.
虽然结合附图描述了本发明的实施例,但是本领域技术人员可以在不脱离本发明的精神和范围的情况下做出各种修改和变型,这样的修改和变型均落入由所附权利要求所限定的范围之内。Although the embodiments of the present invention have been described in conjunction with the accompanying drawings, those skilled in the art can make various modifications and variations without departing from the spirit and scope of the present invention. within the bounds of the requirements.
Claims (10)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310403129.6A CN116401566A (en) | 2023-04-14 | 2023-04-14 | Grouping method, grouping device, computer equipment and storage medium for data objects |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310403129.6A CN116401566A (en) | 2023-04-14 | 2023-04-14 | Grouping method, grouping device, computer equipment and storage medium for data objects |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN116401566A true CN116401566A (en) | 2023-07-07 |
Family
ID=87012163
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202310403129.6A Pending CN116401566A (en) | 2023-04-14 | 2023-04-14 | Grouping method, grouping device, computer equipment and storage medium for data objects |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN116401566A (en) |
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20140221022A1 (en) * | 2013-02-06 | 2014-08-07 | Andrea Vaccari | Grouping Ambient-Location Updates |
| CN112148942A (en) * | 2019-06-27 | 2020-12-29 | 北京达佳互联信息技术有限公司 | Business index data classification method and device based on data clustering |
| CN112733890A (en) * | 2020-12-28 | 2021-04-30 | 北京航空航天大学 | Online vehicle track clustering method considering space-time characteristics |
| CN113988185A (en) * | 2021-10-29 | 2022-01-28 | 平安科技(深圳)有限公司 | Data processing method and related device |
| CN114218997A (en) * | 2021-10-29 | 2022-03-22 | 中国建设银行股份有限公司 | Experimental data grouping method, device, medium and electronic equipment |
| CN115730126A (en) * | 2021-08-26 | 2023-03-03 | 腾讯科技(深圳)有限公司 | Data processing method and device, computer readable storage medium and computer equipment |
-
2023
- 2023-04-14 CN CN202310403129.6A patent/CN116401566A/en active Pending
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20140221022A1 (en) * | 2013-02-06 | 2014-08-07 | Andrea Vaccari | Grouping Ambient-Location Updates |
| CN112148942A (en) * | 2019-06-27 | 2020-12-29 | 北京达佳互联信息技术有限公司 | Business index data classification method and device based on data clustering |
| CN112733890A (en) * | 2020-12-28 | 2021-04-30 | 北京航空航天大学 | Online vehicle track clustering method considering space-time characteristics |
| CN115730126A (en) * | 2021-08-26 | 2023-03-03 | 腾讯科技(深圳)有限公司 | Data processing method and device, computer readable storage medium and computer equipment |
| CN113988185A (en) * | 2021-10-29 | 2022-01-28 | 平安科技(深圳)有限公司 | Data processing method and related device |
| CN114218997A (en) * | 2021-10-29 | 2022-03-22 | 中国建设银行股份有限公司 | Experimental data grouping method, device, medium and electronic equipment |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN110431560B (en) | Target person searching method, device, equipment and medium | |
| CN112860706A (en) | Service processing method, device, equipment and storage medium | |
| US20200133951A1 (en) | Systems and methods for data storage and data query | |
| CN112214561B (en) | Map data processing method, map data processing device, computer equipment and storage medium | |
| KR20220122564A (en) | Update method for face database, and face recognition method, apparatus and system | |
| CN108363788A (en) | Model intelligently ranks method, apparatus and computer readable storage medium | |
| CN111427971A (en) | Business modeling method, device, system and medium for computer system | |
| CN111161067A (en) | Method and device for determining transaction route | |
| CN114676272A (en) | Information processing method, device, device and storage medium for multimedia resources | |
| CN107729944B (en) | Identification method and device of popular pictures, server and storage medium | |
| CN113705363B (en) | Method and system for identifying uplink signals of specific satellites | |
| CN111898619A (en) | Picture feature extraction method and device, computer equipment and readable storage medium | |
| CN111049988A (en) | Intimacy prediction method, system, device and storage medium for mobile devices | |
| CN111311305A (en) | Method and system for analyzing user public traffic band based on user track | |
| WO2021051562A1 (en) | Facial feature point positioning method and apparatus, computing device, and storage medium | |
| US20250104051A1 (en) | Efficient Creation of Non-Fungible Tokens | |
| CN116401566A (en) | Grouping method, grouping device, computer equipment and storage medium for data objects | |
| CN114201470A (en) | Form information display method and device and computer equipment | |
| CN113791425A (en) | Radar P display interface generation method and device, computer equipment and storage medium | |
| CN110737820A (en) | Method and apparatus for generating event information | |
| CN110580270B (en) | Address output method and system, computer system and computer readable storage medium | |
| CN118627483A (en) | Table data processing method, device, equipment and storage medium | |
| CN114218638B (en) | Panoramic image generation method and device, storage medium and electronic device | |
| CN113362097B (en) | User determination method and device | |
| CN108280139B (en) | POI data processing method, device, equipment and computer readable storage medium |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |