WO2023151789A1

WO2023151789A1 - Method for verifying an approval of a data set of a data collection job performed by a data collection system, data collection system, data supply unit, server and release verification device

Info

Publication number: WO2023151789A1
Application number: PCT/EP2022/053145
Authority: WO
Inventors: Oliver SCHIMMEL
Original assignee: Cariad SE
Current assignee: Cariad SE
Priority date: 2022-02-09
Filing date: 2022-02-09
Publication date: 2023-08-17
Anticipated expiration: 2024-08-09
Also published as: EP4445279A1

Abstract

The invention is concerned with a method for verifying an approval of a data set (9) of a data collection job performed by a data collection system (1), wherein at least one observation data element (8) of the data set (9) is provided by data supply units (4) of the data collection system while the data supply units (4) are operated by users and collected in a data set (9) by a server (5) of the data collection system (1), the method comprising a release of the data collection request (6) for the data collection system by the release verification device (3), in case a degree of the personal reference of the data set (9) fulfills a predefined a priory condition (23), wherein the a priory condition (23) is related to a reduced risk of assignability of one of the data supply units (4) to a particular user based on the data set (9).

Description

Method for verifying an approval of a data set of a data collection job performed by a data collection system, data collection system, data supply unit, server and release verification device

DESCRIPTION:

The invention is concerned with a method for verifying an approval of a data set of a data collection job performed by a data collection system, a data collection system, a data supply unit, a server and a release verification device.

In order to improve devices of vehicles it is necessary to collect observation data provided by vehicles in the field while the vehicles are operated by drivers. An analysis of the observation data collected by the vehicles allows an identification of errors and deficiencies in current devices of the vehicles. To obtain observation data for analysis, it is common to record data of vehicles operated by test drivers. A deeper analysis needs a certain amount of the observation data that cannot be provided by test drivers only. It is therefore necessary to include other sources of observation data in the data collection.

The connection of vehicles via mobile internet allows the gathering of the observation data provided by vehicles of customers to increase the amount of collected observation data related to several driving conditions. The involvement of observation data of costumers results in the risk of an identification of single drivers providing observation data to the system. An identification of a single driver is an intrusion into the privacy of the respective driver because it allows a tracking of his habits. It is therefore necessary to perform a proper risk assessment in order to determine the risk of an identification of a single user. The risk of an identification depends on several factors like the number of participants providing observation data, the accuracy of the data and the amount of the data. Unfortunately, it is not possible to determine the risk of an identification before the observation data are collected by a data collection system.

US 2016/0164922 A1 describes a computer-implemented method for managing an authentication policy for a user on a network. An authentication policy management system assesses individual user attributes and generates a risk value for each of the attributes for a user.

US 2020/0042723 A1 describes an identity fraud risk assessment platform. The risk assessment platform determines a level of risk of identity fraud associated with a user based on a first and a second user and device attributes. The risk assessment platform grants or denies the user access to a second protected resource based on the determined level of risk of identity fraud associated with the user.

US 2021/ 0089680 A1 describes a policy driven data movement for moving personal and sensitive data from a source filesystem to a destination filesystem while enforcing a source privacy legal framework.

It is an objective of the present invention to provide a method to approve a personality of a data set comprising data provided by users in order to avoid an identification of a single user.

The objective is accomplished by the subject matter of the independent claims. Advantageous developments with convenient and non-trivial further embodiments of the invention are specified in the following description, the dependent claims and the figure.

The invention comprises a method for verifying an approval of a data set of a data collection job performed by a data collection system. At least one observation data element of the data set is provided by data supply units of the data collection system while the data supply units are operated by users. The at least one observation data element of the data set is collected in a data set by a server of the data collection system. In other words, the method is related to a collection of the data set, wherein the data set comprises at least one observation data element. The at least one observation data element is provided by the data supply units of the data collection system. The at least one observation data element is added to the data set by the server of the data collection system. As the collection of the at least one observation data element may violate a privacy of the users it is necessary to prove the data collection job before it is performed.

A first step of the method comprises a reception of a data collection request by a release verification device. The data collection request comprises a utility policy defining the at least one observation data element of the data set and the predefined accuracy of the at least one observation data element of the data set to be collected by the data collection system. In other words, the release verification device receives the data collection request, which requests the performance of the data collection job by the data collection system. In order to define the at least one observation data element of the data set, that has to be collected during the data collection job and the accuracy of the at least one observation data element, the data collection request comprises the utility policy. The release verification device may be designed as a computer configured to determine whether the data collection job requested in the data collection request is to be approved for the data collection system. The data collection request may be a message sent by a data analysis device computer to the release verification device. The utility policy may be designed as a configuration file listing the one or more observation data to be collected by the data supply units and respective qualities that may define an accuracy of the respective observation data.

A second step of the method comprises a calculation of a degree of a personal reference of the data set to be collected by the data collection system according to a predefined a priori estimation method by the release verification device. In other words, the release verification device calculates the degree of the personal reference of the data set that has to be provided by the data collection system. The calculation is performed according to the predefined a priori estimation method. The estimation may be based on a predicted data set. The degree of a personal reference may describe the risk of identification of a single user or the risk that a specific user may be assigned to a specific data supply unit based on the at least one observation data element of the data set.

A third step of the method comprises a release of the data collection request for the data collection system by the release verification device in case the degree of the personal reference of the data set to be collected fulfills a predefined a priori condition. In other words, the data collection request is forwarded to the data collection system by the release verification device so that the data collection job may be performed by the data collection system according to the data collection request. The release will be performed if the degree of the personal reference of the data set to be collected, that was calculated according to the predefined a priori estimation method, fulfills the predefined a priori condition. The a priori condition is related to a reduced risk of assignability of one of the data supply units to a particular user based on the data set to be collected. The reduced risk may be predefined and evaluated using privacy models of the state of the art. In other words, it is necessary that the personal reference satisfies the a priori condition in order to be executed by the data collection system. The a priori condition is designed to limit the risk that one of the data supply units may be assigned to one of the users by analyzing the data set.

A fourth step comprises an execution of the data collection job by the data collection system according to the utility policy of the data collection request and a transmission of the collected data set to the release verification device. In other words, the data collection system performs the data collection job to provide the data set. The data collection job is performed according to the utility policy of the data collection request in order to provide the data set comprising the at least one observation data element of the predefined accuracy. The execution of the data collection job may comprise a measurement of the at least one observation data element of the predefined accuracy by the data supply units and the transmission of the at least one observation data element to a server of the data collection system by the data supply units, wherein the server creates the collected data set by combining the observation data provided by the data supply units. The server transmits the data set to the release verification device.

A fifth step comprises a calculation of a degree of a personal reference of the data set collected by the data collection system according to a predefined a posteriori estimation method by the release verification device. In other words, the release verification device calculates the degree of the personal reference of the data set that was created during the data collection job. The personal reference is calculated according to the predefined a posteriori estimation method that may be different from the a priori estimation method. In other words, the personal reference is calculated twice. A first calculation is made before the data collection job is performed according to the a priori estimation method. The first calculation is based on expected observation data of the data set. The second calculation is made after the performance of the data collection job, wherein the second calculation is performed according to the a posteriori estimation method. The second calculation is based on the data set that was collected during the data collection job.

In case the degree of the personal reference of the data set collected by the data collection system fulfills a predefined a posteriori condition, the data set collected by the data collection system is released by the release verification device. The a posteriori condition is related to a reduced risk of assignability of one of the data supply units to a particular user based on the data set collected by the data collection system.

The invention has the advantage that privacy protection is considered in two verification steps. The first verification step is performed before the execution of the data collection job - based on an expected data base (synthetic, simulated, or real data from a different job) and the required quality defined in the utility policy - in order to estimate, whether it has to be expected that the resulting real data set needs to be considered as personal related or not, resulting in adapted protection measures. The second verification step is performed after the execution of the data collection job. In the second verification, it may be verified whether the privacy policy is fulfilled with the required quality measures defined in the utility policy applied on the real data.

The risk of identification is related to a predefined technically reasonable effort, which is specified in the privacy policy. The reasonable effort is dependent on implemented technical and organizational security measures as well as applied anonymization techniques such as k-anonymity.

The invention also comprises embodiments that provide features which afford additional technical advantages.

According to a further embodiment of the invention, the degree of the personal reference is based on k-anonymity, l-diversity, t-closeness or differential privacy. In other words, the degree of the personal reference is given in k- anonymity, l-diversity, t-closeness or differential privacy. Details about the aforementioned procedures can be found in Domingo-Ferrer, Josep, and Jordi Soria-Comas. "From t-closeness to differential privacy and vice versa in data anonymization." Knowledge-Based Systems 74 (2015): 151-158. K-ano- nymity, l-diversity, t-closeness or differential privacy are examples to determine the degree of the personal reference. It is also possible to use other metrics known from prior art. The embodiment has the advantage, that the degree of the personal reference is given as a size or property which is common or widespread in the state of the art.

According to a further embodiment of the invention, the degree of the personal reference of the data set to be collected, is calculated by the release verification device using the predefined a priori estimation method comprising the step of reading out a weight value assigned to the at least one observation data element of the data set of the respective accuracy of the at least one observation data element from a database, wherein the weight value de- scribes the specific degree of the personal reference of the at least one observation data element. In other words, the release verification device comprises a database, wherein the database assigns a weight value to each observation data of each accuracy. The weight value is related to the personal reference of the single observation data of the respective accuracy alone. The release verification device may read out every weight value saved in the database of the observation data to be provided in the data set. In a next step, the release verification device performs a combination of the weight values of the at least one observation data element of the data set to be collected according to a predefined combination method to get the degree of the personal reference of the data set to be collected. In other words, the next step is related to the combination of the weight values in order to get the personal reference of the data set comprising the observation data. The combination is performed by the release verification device according to the predefined combination method. The predefined combination method may determine the combination of the single weight values respecting interferences between the observation data.

According to a further embodiment of the invention, the execution of the data collection job by the collection system comprises predefined anonymization steps defined in a privacy policy saved in the data collection system. Anonymization can be achieved by data manipulation such as generalization, syn- thetization, noise or other known methods. In other words, the data collection system performs predefined anonymization steps during the data collection job in order to anonymize the data set. Therefore, it is possible to reduce the risk of assignability of the data set. The predefined anonymization steps and parameters used in the predefined anonymization steps are defined in the privacy policy of the data collection system. The privacy policy may be a configuration file, that may be saved on a server of the data collection system and/or the data collection units.

According to a further embodiment of the invention, the predefined anonymization steps comprise a generalization of the at least one observation data element by the data supply units after measurement of the at least one observation data element to achieve the predefined accuracy of the at least one observation data element according to the privacy policy. In other words, it is possible that the at least one observation data element may be measured by the data supply units in an accuracy that is higher than the accuracy requested according to the utility policy. The data supply units may reduce the accuracy of the at least one observation data element by anonymization and/or generalization data in order to limit the risk of assignability.

According to a further embodiment of the invention, the predefined anonymization steps comprise a relaying of the at least one observation data element from the data supply units to the server via a relay device of the data collection system, wherein transmission data and/or identification data of the data supply units provided by the supply units are removed by the relay device before the at least one observation data element is forwarded to the server. In other words, the observation data provided by the data supply units is not sent to the server directly but via the relay device. The observation data itself may not comprise data suitable for an identification of a single data supply unit. However it may be possible that the transmission data of the transmission of the observation data are provided to the server by the data supply units. These transmission data may allow a link of the observation data to the data supply unit, providing the observation data. The transmission data may comprise an IP-address or other identification data of the data supply units. Therefore, it is intended to transmit the observation data via the relay device in order to hide the respective data supply units. The relay device may be a computer or an overlay network like TOR that may remove transmission data and/or identification data of the data supply units. The relay device sends the observation data to the server without the transmission data and/or identification data provided by the data supply units. The server may receive transmission data and/or identification data of the relay device, but not the transmission data and/or identification data of the data supply units. This embodiment has the advantage, that it is not possible to link the observation data to a single data supply unit by means of transmission data and/or identification data of the transmission. According to a further embodiment of the invention, the predefined anonymization steps comprise an anonymization and/or reduction of the accuracy of the at least one observation data element by the server of the data collection system according to the privacy policy. In other words, the observation data are anonymized or reduced by the server according to the privacy policy. The embodiment has the advantage, that a centralized anonymization or generalization of the data set may be performed by the server after the reception of the observation data of the data supply units.

According to a further embodiment of the invention, the data collection request comprises an auxiliary utility policy defining the at least one observation data element of the data set and an auxiliary accuracy of the at least one observation data element of the data set to be collected by the data collection system. The auxiliary utility policy may be used by the collection system in case the degree of the personal reference of the data set collected by the data collection system according to the utility policy does not fulfill the predefined a posteriori condition. In other words, the data collection request comprises the utility policy and the auxiliary utility policy. The utility policy may define the at least one observation data element of the data set and the accuracy of the at least one observation data element of the data set that has to be provided by the data collection job. The utility policy may define a preferred configuration of the data set. However, in case that the utility policy leads to the data set having a degree of personal reference that does not comply with the conditions and/or the privacy policy, the auxiliary utility policy of the collection request may be used instead. The auxiliary utility policy describes the details of the data collection job that may provide a data set that may comply with the conditions and/or the privacy policy. A calculation of a degree of a personal reference of the data set according to the auxiliary utility policy collected by the data collection system according to the predefined a posteriori estimation method is performed by the release verification device. The release of the data set collected by the data collection system according to the auxiliary utility policy by the release verification device may be performed in case the degree of the personal reference of the data set collected by the data collection system according to the auxiliary utility policy fulfills the predefined a posteriori condition.

According to a further embodiment of the invention, the method comprises steps of comparison of the degree of the personal reference of the data set to be collected by the data collection system according to the predefined a priori estimation method with the degree of the personal reference of the data set collected by the data collection system according to the predefined a posteriori estimation method by the release verification device. A further step comprises an adaption of measures defining organizational and technical protection of the data sets as a whole according to a predefined adaption method. In other words, the release verification device compares the personal reference calculated before the data collection job with the personal reference calculated after the data collection job. The release verification device adapts the measures according to a predefined adaption method depending on the difference between the personal reference before the data collection job with the personal reference of the data collection job. In other words, the release verification device may change the measures defining the organizational and technical protection steps of the data sets as a whole according to the predefined adaption method depending on the difference between the personal references. The embodiment has the advantage, that the release verification device may evaluate the measures in order to reduce a difference between the personal references calculated before and after the data collection job. Therefore, the personal reference calculated according to the a priori estimation method may be more reliable in further analysis.

According to a further embodiment of the invention, the method comprises an adaption of the predefined a priori estimation method by the release verification device depending on the difference between the personal reference calculated before the data collection and the personal reference calculated after the data collection. In other words, the release verification device may fit parameters used in the predefined a priori estimation method, in order to reduce the difference between the personal references. The invention comprises a data collection system configured to execute a data collection job to collect at least one observation data element of a data set of a predefined accuracy of a data set to be collected according to a utility policy of a data collection request, and to transmit the collected data set to a release verification device. The data collection system comprises data supply units configured to provide the at least one observation data element of the data set while the data supply units are operated by users and a server configured to collect the at least one observation data element in the data set.

The invention comprises a data supply unit of a data collection system, configured to provide at least one observation data element of a data set of a predefined accuracy of a data set to be collected according to a utility policy of a data collection request while the data supply unit is operated by a user.

The invention comprises a server of a data collection system, configured to collect at least one observation data element in a data set.

The invention comprises a release verification device. The release verification device is configured to receive a data collection request, wherein the data collection request comprises a utility policy, defining the at least one observation data element of the data set and a predefined accuracy of the at least one observation data element of the data set to be collected by the data collection system. The release verification device is configured to calculate a degree of a personal reference of the data set to be collected by the data collection system according to a predefined a-priory estimation method. The release verification device is configured to release the data collection request for the data collection system, in case the degree of the personal reference of the data set to be collected fulfills a predefined a-priory condition, wherein the a-priory condition is related to a reduced risk of assignability of one of data supply units of the data collection system to a particular user based on the data set to be collected. The release verification device is configured to receive the collected data set from the data collection system and to calculate a degree of a personal reference of the data set collected by the data collection system according to a predefined a posteriori estimation method. The release verification device is configured to release the data set collected by the data collection system, in case the degree of the personal reference of the data set collected by the data collection system fulfills a predefined a posteriori condition, wherein the a posteriori condition is related to a reduced risk of assignability of one of the data supply units to a particular user based on the data set collected by the data collection system.

The invention also comprises embodiments of the inventive data collection system, the inventive data supply unit, the inventive server, the inventive release verification device that comprise features that correspond to features as they have already been described in connection with the embodiments of the inventive method. For this reason, the corresponding features of the embodiments of the inventive data collection system, the inventive data supply unit, the inventive server, the inventive release verification device are not described here again.

The inventive data collection system, the inventive data supply unit, the inventive server, the inventive release verification device may comprise a data processing device or a processor circuit adapted to perform an embodiment of the method according to the invention. For this purpose, the processor circuit may comprise at least one microprocessor and/or at least one microcontroller and/or at least one FPGA (field programmable gate array) and/or at least one DSP (digital signal processor). Furthermore, the processor circuit may comprise program code which comprises computer-readable instructions to perform the embodiment of the method according to the invention when executed by the processor device. The program code may be stored in a data memory of the processor device.

The inventive data supply unit may be designed as a device for a vehicle that is preferably designed as a motor vehicle, in particular as a passenger vehicle or a truck, or as a bus or a motorcycle.

The invention also comprises the combinations of the features of the different embodiments. The embodiment explained in the following is a preferred embodiment of the invention. However, in the embodiment, the described components of the embodiment each represent individual features of the invention which are to be considered independently of each other and which each develop the invention also independently of each other and thereby are also to be regarded as a component of the invention in individual manner or in another than the shown combination. Furthermore, the described embodiment can also be supplemented by further features of the invention already described.

In the figures identical reference signs indicate elements that provide the same function.

In the following an exemplary implementation of the invention is described. The only figure shows:

Fig. a schematic illustration of a method for verifying an approval of a data set of a data collection job.

The only figure Fig. shows a schematic illustration of a method for verifying an approval of a data set of a data collection job.

A first step P1 may comprise a reception of a data collection request 6 by a release verification device 3. The data collection request 6 may be send by a data processing device 2 in order to initiate a data collection job to be performed by the data collection system 1 . The data collection request 6 may depend on a use case and may define data to be collected by the data collection system 1 for the data processing device 2. The data collection system 1 may comprise data supply units 4 that may be arranged in vehicles and a server 5. The data collection request 6 may comprise a utility policy 7 defining a data set 9 to be provided by the data collection system 1 during the data collection job. The utility policy 7 may define observation data 8 of the data set 9 and a accuracy of each of the observation data 8. The utility policy 7 may define a preferred data set 9. The data collection request 6 may also comprise an auxiliary utility policy 11 describing an auxiliary data set 13, wherein the auxiliary data set 13 comprises auxiliary observation data element 12 of an auxiliary accuracy 14. The auxiliary utility policy 11 may define an auxiliary data set 13, that may be provided by the data collection system 1 in case the data set 9 according to the utility policy 7 is not approved by the release verification device 3.

After reception of the data collection request 6 by the release verification device 3, the release verification device 3 may perform a calculation of a degree of personal reference 17 of the data set 9 to be collected by the data collection system 1 . The calculation may be performed according to a predefined a priori estimation method P2. The a priori estimation method may utilize a database 20, wherein the database 20 comprises weight values 21 , that are assigned to the observation data 8 of the data set 9 of the respective accuracies 10. The weight values 21 may describe a degree of personal reference of the single observation data 8 of the respective accuracy . In a next step, the release verification device 3 may combine the weight values 21 of the at least one observation data element 8 of the data set 9 according to a predefined combination method, to get the degree of the personal reference of the data set 9 to be collected. The personal reference of the data set 9 may define a risk to identify single user of the data collection system 1 based on the data set 9. In order to perform the data collection job, defined in the data collection request 6, the degree of the personal reference of the data set 9 to be collected has to fulfil a predefined a priori condition 23. The predefined a priori condition 23 may define a maximum value of the degree of the personal reference of the data set 9. If the degree of personal reference 17, 19 does not fulfil the predefined a priori condition 23, the release verification device 3 may calculate the personal reference of the data set 9 to be collected by the data collection system 1 according to the auxiliary utility policy 11 . In case the personal reference of the data set 9 to be collected according to the auxiliary utility policy 11 , the release verification device 3 may also use the predefined a priori estimation method. If the calculated degree of personal reference 17 satisfies the a priori condition 23, the data collection job will be performed according to the auxiliary utility policy 11 . In case the degree of personal reference 17, 19 of the data collection job according to the utility policy 7 fulfils the predefined a priori condition 23, the data collection job will be performed according to the utility policy 7 by the data collection system 1 . The data collection job may comprise a collection of the at least one observation data element 8 of the data set 9 by the data supply units 4 of the data collection system 1 P3. In order to ensure privacy, the data collection system 1 may comprise a privacy policy 15. The privacy policy 15 may define several anonymization steps S1 , S2, S3 to anonymize the data set 9. The anonymization steps S1 , S2, S3 may define a manipulation of the data and the data set 9 to reduce the degree of personal reference 17, 19. The privacy policy 15 can be a predefined set of rules for the system. For example, it may have been defined by a data protection officer to ensure compliance with data protection standards. The privacy policy 15 may contain specifications on the measures that can be used, such as generalization, or on the metric to be used to measure the reference to a person. The privacy policy 15 may not only specify which anonymization steps S1 , S2, S3 are used, but can also provide definitions and guidelines that may specify under which conditions the data set 9 is considered "anonymous" or "secure enough". For example, the privacy policy 15 may specify that the data set 9 is considered anonymous when a k-anonymity of 5 is achieved. The privacy policy 15 may comprise a set of rules of/for the release verification device 3. In other words, the release verification device may use the specifications of the privacy policy 15 to verify the data collection request 6 and to decide whether the data set 9 can be released directly or whether additional measures like the anonymization steps S1 , S2, S3 are required.

The anonymization steps may comprise a step S1 of generalization or anonymization of the observation data 8 collected by the data supply units 4 performed by each of the data supply units 4 to reduce the accuracy of the at least one observation data element 8.

In order to create the data set 9, the data supply units 4 may transmit the observation data 8 to a server 5 of the data collection system 1 P4. In order to ensure privacy, the step of transmission may comprise a transmission of the observation data 8 to a relay device 16 of the data collection system 1 S2. The relay device 16 may transmit the observation data 8 to the server 5. Therefore, no direct connection between the data supply units 4 and the server 5 is connected. Therefore, it is not possible to assign the observation data 8 to a single data supply unit 4 by analysing transmission or identification data provided during the transmission.

In a step P5 the server 5 may receive the observation data 8 and may add them to the data set 9. To ensure privacy, the server 5 may perform generalization or anonymization steps S3 to reduce the accuracy of the observation data 8 in the data set 9.

As the data set 9 collected by the data collection system 1 may be different from an expected data set 9, the data set 9 may be provided to the release verification device 3 in a step P6, in order to calculate the degree of personal reference 17, 19 of the data set 9 collected by the data collection system 1 according to a predefined a posteriori estimation method. The a posteriori method may be different than the a priori estimation method used before performing the job. In order to provide the data set 9 to the data processing device 2, that has requested the data collection job, the degree of personal reference 17, 19 of the data set 9 must fulfil the a posteriori condition 18.

In case the a posteriori condition 18 is fulfilled, the data set 9 may be sent to the data processing device 2, which may analyse the data set 9 P7. In case the personal reference of the data set 9 collected does not fulfil the a posteriori condition 18, data protection measures 22 for protecting the data set 9 may be tightened by the data collection system 1 , in order to reduce the degree of personal reference 19 of the data set 9 or to limit a access to the data set 9. As an example, the server 5 of the data collection system 1 may repeat the generalization and/or anonymization step S3 in order to reduce the accu- racylO of the observation data 8, when the privacy policy 15 is changed, wherein the accuracy 10 is set to a lower value. It may happen that the release verification device 3 determines that the a posteriori conditions 19 or the requirements of the privacy policy 15 are not met by the data set 9 when performing steps S1 -S3. In this case, measures 22 of a data protection device for the organizational and technical protection of data records as a whole like for example authorized access, encryption, employee sensitization of the system 1 may be more stringently enforced. This may include, for example, a stronger restriction of access to the data set 9 to smaller groups of people and a stronger read protection of the data set 9. As a further measure, a collection of a larger volume of data by the data collection system 1 may be arranged, which may also be processed in steps S1 -S3. One option may provide that the data set is discarded if, for example, it is foreseeable that compliance with specifications regarding the reference to persons cannot be maintained.

S1 -S3 may be perforated according to the privacy policy 15. The application of steps S1 -S3 may lead to the fulfillment of requirements for a personal reference. However, it may happen that the steps S1 -S3 are not sufficient to fulfill the requirements. In this case, the release verification device 3 may wait for more data, adjust the measures 22, or refuse to release the data set 9.

The measures 22 may be such that the protection measures may depend on the degree to which the data set is personal. If the data set has no or a predetermined low degree of personal involvement, fewer restrictions on access to the data set may be imposed by the measures. If the data set has a predetermined higher degree of person reference, stronger restrictions regarding data access to the data set may be specified by the measures. This may include, for example, that fewer persons are authorized to access the data set and a stricter access protection to the data set is specified.

The release decision by the release verification device 3 is thus dependent on the actual applicability of the privacy policy 15 and the additional measures 22 taken with the goal of achieving the specifications of the data collection request 6. This risk assessment was started based on a use case for collecting data not only from test drivers, but also from normal customers. The use case is motivated by the fact that the amount of necessary test kilometres for autonomous driving development cannot be achieved efficiently with test vehi- cles/drivers only anymore. Dealing with customer data for research and development purposes brings new challenges for being compliant to legal regulations and protecting individual rights and privacy. Data protection regulations like GDPR allow us to process personal data under a limited amount of reasons, which we have to declare beforehand. As part of the process to declare the reasoning, we need to make risk assessments to compare the benefit of the data processing versus the risk for the individual in case of a data breach or a misuse of the data. With data minimization on mind and the opportunity to protect personal information by data manipulation or anonymization technologies, the estimation of the risk before the data collection has even started causes a dilemma: we only know if the data manipulation - as a privacy protection measure - is sufficient, when we have performed the measures on the real data. For a proper risk assessment we need to make an analysis of data prior and post data collection, as we will find out later. We will need to consider a Utility Policy 7 which degree of data accuracy is necessary and Privacy Policy 15 which protection measures need to be implemented. First we want to list some challenges we have to confront when it comes to risk assessment of personal data.

Legislation makes it mandatory to handle risks for individuals when dealing with their data. Rating the risk of personal data can be quite challenging.

When thinking about development of embedded vehicle systems, the focus of developers is on enhancing functionality and make driving more safe and comfortable. A profiling of an individual is generally not of any interest for a developer. Anonymization of data for development purposes is therefore a valid option to protect the privacy of individuals (data subjects) while containing necessary information. By choosing data necessary only for the specific development purpose, the developer can also follow the privacy principle of data minimization and safe data bandwidth, which is a limiting factor for auto- motive industry today. Nevertheless, some use cases might not know the necessary data beforehand or need raw data like video or precise GPS locations to be able to re-simulate the environment of the recorded event, adjust sensors parameters, identify objects, or understand functional misbehavior of the system. On the other side there are use cases where profiling of the data subject is the main motivation for collecting data, i.e. for example to personalize services requested by the subject itself. With growing integration of information systems in the vehicle, services like personalized navigation based on individual interests, vehicle updates and -grades based on driving behavior, predictive maintenance, are getting more and more. Legally this type of data collection is mostly arranged between the data controller/ processor and the subject by consent or contract. In the list of data collection purposes we also have to add legally mandatory data like safety critical information including e-call or monitoring systems like intrusion detection to provide security of the vehicle and fleet, just to give some examples.

The estimation of the risk of processing all different types of data for diverse use cases is a challenging task, specifically when processes need to be automated. Probably the most challenging factor is to find a clear definition about which data is personal and where the subjects individual tolerance is. Every individual human being has its own personal limits about sharing information, which makes it hard to satisfy everybody. This makes it mandatory to define an assessment strategy, which is adjustable based on subjects feedback, change of regulations, new attacks, data leaks, etc. For a privacy risk assessment, not only a single record (one snapshot of a situation containing different attributes) has to be rated, but also combinations of records of a single individual. When thinking of anonymization, deidentification of records (removing direct identifiers like VINs, IP-addresses, names, etc.) from a single record is a first step. While a de-identified, singled out record itself might not reveal sensitive information of an individual, the possibility remains that the simple knowledge of the existence of a subject inside of a record data base - out of any possible external information, e.g. social media - might help revealing sensitive information. Therefore anonymization measures mainly measure their efficiency based on how good an individual can be hidden in a set of data from many individuals, in other words, the bigger the group of records with the same - maybe generalized - information, the less probable it is to single out one individual. This means, to be able to anonymize data, a specific amount of data - ideally from many different subjects - is mandatory. The amount of expected data also highly depends on how specific the situation of the developers interest is and how likely it is to happen. While there are cases we would like to know statistics of big data from the fleet, there might be specific cases, which are extremely rare, but very important, e.g. corner case detection, where safety functions suddenly fail. On the other hand there are use cases which are completely fine with reducing the original data accuracy by generalization, adding noise, synthetization, and other techniques.

What is rarely discussed in literature is the situation we are confronted with, when bringing the worlds - anonymization and legal conformity - together. The formal process - at least for GDPR - requires an a-priory estimation or even a declaration of the possible risks related with the planned collection of data, which is - as described before - not possible to pre-calculate, because we don't know how our database will look like. To handle this dilemma, estimations have to be done and later put into perspective with the outcome of the applied protection measures on the real data.

We have to challenge the fact that data protection and big data analysis are having some contradicting - but for both sides valid - motivations: Data protection focuses on the reduction of personal related data to the minimum necessary. Data driven development wants to benefit from a huge amount of data to learn more about the unknown. We have to consider intensions from developers to merge data from different data sources (use cases) to gather more information (leaving aside the technical difficulties to prevent this after all). Therefore the anonymization metric needs to be able to consider (gradual) background knowledge. An a-prori risk assessment to be able to estimate the degree of "personality" in the requested data as an input for a data protection impact assessment (Datenschutzfolgeabschatzung). A post risk assessment to proof, if the considered measures are giving the expected results with real data to approve the final access to the data.

The process starts with an entity requesting a data collection for a data analytics purpose P1 . Therefore the entity is directly describing his utility policy by defining the mandatory data, data accuracy, and mandatory user roles. Accuracy may also be named quality. This also goes well with the concept of data minimization and need to know principle requested by GDPR. With this information and a pre-defined privacy policy, an a-priori risk estimation P2 of the expected data can be performed. This risk as well as the technical organizational measures (TOMs) from the IT-Security department will be part of the entry to the register of processing data. If the a-priori condition is met, the data collection job can start. While data is collected in the vehicle P3, first measures S1 defined in the privacy policy can be performed to reduce data and personal information. The providing of the observation data by the data supply units cannot be considered to be anonymous, since the observation data have to be transmitted to the server. The transmission at least needs an individual IP address of the respective data supply units. During the data transport P4 to the server, individuality can be removed by techniques using a relay device, third party escrow (trusted entity) or TOR like systems to ob- fuscate/remove traceability to the data source S2. After data has received in the server which may be a backend or a cloud, real anonymization measures S3 and metrics to calculate the personal information in the total data set can be performed by the server.

The post risk assessment P6 now can compare the real data after applying privacy protection measures S1 , S2, S3 with the a priori estimated risk and if necessary, adjust the utility policy (if possible) or TOMs. Continuous monitoring of the privacy policy at a surveillance device as well as the TOMs at the IT-Security department assures protection for future protection violations. To understand the following concept of a possible a-priori risk estimation P2, it is necessary to understand, how anonymization measures are functioning. To explain it in a few words, they take a look at a data set, check for individual attribute constellations or small groups revealing a common information, if the group size is not big enough (pre-defined minimum), they start manipulating attributes (e.g. generalization) until they have reached the acceptable threshold. For the algorithms to work, they need to know, which data can be manipulated in which manner, and how far, so data accuracy is not completely lost. Therefore we need catalogues for our use cases to feed the al- gorithms. The only person being able to set the acceptable level for the data accuracy, or estimating the expected amount of data is the developer. We can use this information to give an estimation about how likely it will be, that an anonymization algorithm will be successful or not, and therefore also being able to differentiate, if the process is likely to be privacy legislation rele- vant or not. Furthermore, we - as the designer of the data processing process - will not know beforehand, which attributes in which combinations are requested by the developer. Therefore we also need a metric to estimate combinations of different attributes. The approach is described in the following. We first will need an adjustable catalogue of possible attributes neces- sary for the developer. We can try to classify them like: date and time location (GPS, proper motion, image/video, etc.) weather conditions (temperature, rain, snow, light, vision, etc.) vehicle conditions (velocity, acceleration, software/hardware versions, etc.) driver conditions (activation of triggers, activated services, video detection of driver face, etc.) etc. Having a classifica- tion, we can go deeper and think about, which variations of data representa- tions/formats can appear. Tab. 1 shows observation data of different accuracies, given in different hierarchy levels.

Tab. 1 observation data of different accuracies.

For example, date and time can look like: DD/MM/YYYY HH:MM:SS

5 DD/MM/YYYY + HH:MM:SS morning, noon, evening, night sunrise, sunset etc. Then we need to specify, which generalization per data format can be applied, also defined as so called "DomainGeneralization-Hierarchy" (DGH) . For example, Hierarchies for time can look like this: HH:MM:SS 00:00 - 00:10, 00:10 - 00:20, ... 00:00 - 00:30, 00:30 - 01 :00, 00:00 - 01 :00, 01 :00 - 0 02:00, ... 00:00 - 02:00, 02:00 - 04:00, ... etc. HH:MM:SS Morning, Noon, Afternoon, Night Day/Night Or another example, location: 11.4061 , 48.79315 ZIP Code 85045 ZIP Code 85*** ZIP Code 8****

ZIP Code 7*/8* ZIP Code 6*/7*/8*/9* ZIP Code 0*/3*/4*/5*/6*/7*/8*/9* Ger5 many Europe 11.4061 , 48.79315 Ingolstadt District Audi Ingolstadt Bavaria South Germany Germany Europe 11 .4061 , 48.79315 11.4106, 48.7813 11 .4, 48.7 From this, one can understand that there are first, many options on how to manipulate data to reduce information, and second, that based on the selection of the hierarchy of interest, attributes reveal different levels of 0 accuracy. We therefore introduce a weight value to regulate the degree of (privacy) sensitivity. A possible database comprising the weight values is shown in Tab. 2.

different accuracies. We will also need to differentiate, if we are dealing with an attribute, which is a direct identifier, a quasi-identifier, or non-identifier. Direct identifier can be: name, address, VIN, IP-address, etc. Quasi-identifier do not directly identify a person, but can be used in combination with other quasi-identifier or additional (external) knowledge (e.g. social media) to identify a subject, such as GPS (e.g. home location), time (working hours, hours out of house, etc.), velocity (fast or slow driver, using highways or only urban streets, etc.), and many others. To give an example, if we record a whole route with GPS coordinates, we might not identify the subject, but the recorded route possibly reveals the home and work address of the subject. We therefore already have two extra information, which can lead into identification of the subject (reading name shield on the door bell). Non-identifier are attributes with low probability of revealing or gaining any useful personal information. Examples are debug information, amount trigger occurrences, etc. Knowing this, removing identifiers from a record does not necessary mean that the record is anonymous. We suggest an estimation of the criticality using specific attributes or combinations as follows: we take the individual attribute (identifier, quasiidentifiers, non-identifier) and multiply it with the corresponding hierarchy weight. Before we sum up all the weighted attributes, we declare that a specific amount of non-identifier will lead into a comparable factor like a quasiidentifier. Also a specific amount of quasi-identifier will have similar values like a direct identifier. To avoid having many non-identifier leading into a critically similar to many quasi-identifier or even identifier, we will use an e-func- tion to limit the total influence. The weighting characteristics are self-chosen and need to be evaluated over real data and time. For the moment, we use the following formula:

RG is the group criticality, giving the degree of personal reference of a group. The groups may comprise non-identifiers, quasi-identifiers and direct identifiers. Wj may be a weight value of a observation data of a specific accuracy. ac, bG and CG are group-specific parameters. The parameters may be a_G = 1, b_G = 7, c_G = 1 for non-identifiers, a_G = 2,5, b_G = 8, c_G = 2,5 for quasi-identi- fiers and a_G = S, b_G = 4,5, c_G = 5 for direct identifiers.

The weighting function parameters are adjusted based on following criteria: non-identifiers shall converge to 1 regardless of how many non-identifier we have, they will not exceed a higher criticality than four quasi-identifier (equal to a criticality around 1 ) quasi-identifier shall converge to 2,5 regardless of how many quasi-identifier we have, they will not exceed a higher criticality than 2,5 identifier (equal to a criticality around 2,5) four quasi-identifier have a comparable value than one identifier shall converge to 5 The total criticality is the sum of all attributes weighted with their respective function. The partial equations are only valid, if there is at least one attribute existing in the category. The sum can reach a maximum at 8,5.

We assume that it will be rather impossible to identify an individual, when criticality is below 0,5. As soon as we exceed 0,5, there are possibilities, that an individual can be identified. Based on the current parameter settings this can happen for example when having two quasi-identifier weighted with one, resulting in a criticality of around 0,55

Overall, the example shows how a risk handling process for personal data originated in one source and being transmitted to a data sink can be provided.

Claims

CLAIMS:

1 . Method for verifying an approval of a data set (9) of a data collection job performed by a data collection system (1 ), wherein at least one observation data element (8) of the data set (9) is provided by data supply units (4) of the data collection system while the data supply units (4) are operated by users and collected in the data set (9) by a server (5) of the data collection system (1 ), wherein the method comprises steps of:

- Reception of a data collection request (6) by a release verification device (3), wherein the data collection request (6) comprises a utility policy (7), defining the at least one observation data element (8) of the data set (9) and a predefined accuracy (10) of the at least one observation data element (8) of the data set (9) to be collected by the data collection system (1 );

- Calculation of a degree of a personal reference (17) of the data set (9) to be collected by the data collection system (1 ) according to a predefined a priory estimation method by the release verification device (3);

- Release of the data collection request (6) for the data collection system by the release verification device (3), in case the degree of the personal reference of the data set (9) to be collected fulfills a predefined a priory condition (23), wherein the a priory condition (23) is related to a reduced risk of assignability of one of the data supply units (4) to a particular user based on the data set (9) to be collected;

- Execution of the data collection job by the data collection system (1 ) according to the utility policy (7) of the data collection request (6) and transmission of the collected data set (9) to the release verification device (3);

- Calculation of a degree of a personal reference (19) of the data set (9) collected by the data collection system (1 ) according to a predefined a posteriori estimation method by the release verification device (3); and - Release of the data set (9) collected by the data collection system (1 ) by the release verification device (3), in case the degree of the personal reference (19) of the data set (9) collected by the data collection system (1 ) fulfills a predefined a posteriori condition (18), wherein the a posteriori condition (18) is related to a reduced risk of assignability of one of the data supply units (4) to a particular user based on the data set (9) collected by the data collection system (1 ).

2. Method according to claim 1 , wherein the degree of personal reference (17, 19) is based on k-anonymity, l-diversity, t-closeness or differential privacy.

3. Method according to claim 1 , wherein the degree of personal reference (17) of the data set (9) to be collected is calculated by the release verification device (3) using a predefined method comprising the steps of:

- reading out a weight value (21 ) assigned to the at least one observation data element (8) of the data set (9) of the respective accuracy (10) of the at least one observation data element (8) from a database (20), wherein the weight value (21) describes a degree of the personal reference of the at least one observation data element (8);

- combination of the weight values (21 ) of the at least one observation data element (8) of the data set (9) to be collected according to a predefined combination method to get the degree of the personal reference (17) of the data set (9) to be collected.

4. Method according to one of the preceding claims, wherein the execution of the data collection job by the data collection system (1 ) comprises predefined anonymization steps (S1 , S2, S3), defined in a privacy policy (15) saved in the data collection system (1 ).

5. Method according to claim 4, wherein the predefined anonymization steps (S1 , S2, S3), comprise generalization of the at least one observation data element (8) by the data supply units (4) after measurement of the at least one observation data element (8) to the predefined accuracy according to the privacy policy (15). Method according to claim 4 or 5, wherein the predefined anonymization steps (S1 , S2, S3) comprise:

- Relaying of the at least one observation data element (8) from the data supply units (4) to the server (5) via a relay device (16) of the data collection system (1 ), wherein transmission data and/or identification data of the data supply units (4) provided by the data supply units (4) are removed by the relay device (16) before forwarding the at least one observation data element (8) to the server (5). Method according to claim 4, 5 or 6, wherein the predefined anonymization steps (S1 , S2, S3), comprise an anonymization and/or reduction of the accuracy of the at least one observation data element (8) by the server (5) according to the privacy policy (15). Method according to one of the preceding claims, wherein the data collection request (6) comprises an auxiliary utility policy (11 ) (7), defining at least one auxiliary observation data element (12) of the data set (9) and an auxiliary accuracy (14) of the at least one auxiliary observation data element (12) of the data set (9) to be collected by the data collection system (1 ), in case the degree of the personal reference (19) of the data set (9) collected by the data collection system (1 ) according to the utility policy (7) does not fulfill the predefined a posteriori condition (18), wherein the method comprises steps of:

- Creation of the data set (9) according to the auxiliary utility policy (11 ) collected by the data collection system (1 ),

- Calculation of a degree of a personal reference (19) of the data set (9) according to the auxiliary utility policy (11 ) collected by the data collection system (1 ) according to the predefined a posteriori estimation method by the release verification device (3); and - Release of the data set (9) collected by the data collection system (1 ) according to the auxiliary utility policy (11) by the release verification device (3), in case the degree of the personal reference of the data set (9) collected by the data collection system (1 ) according to the auxiliary utility policy (11) fulfills the predefined a posteriori condition (18).

9. Method according to one of the preceding claims, comprising the steps of

- Comparison of the degree of the personal reference (17) of the data set (9) to be collected by the data collection system (1 ) according to the predefined a priory estimation method with the degree of the personal reference (19) of the data set (9) collected by the data collection system (1 ) according to the predefined a posteriori estimation method by the release verification device (3),

- Adaption of measures (22) defining organizational and technical protection of the data sets as a whole according to a predefined adaption method.

10. Method according to one of the preceding claims, comprising the steps of

- Comparison of the degree of the personal reference (17) of the data set (9) collected by the data collection system (1 ) according to the utility policy (7) with the degree of the personal reference (19) of the data set (9) collected by the data collection system (1 ) according to the utility policy (7) by the release verification device (3),

- Adaption of the predefined a priory estimation method by the release verification device (3).

11 . Data collection system (1 ) configured to execute a data collection job to collect at least one observation data element (8) of a data set (9) of a predefined accuracy (10) of a data set (9) to be collected according to a utility policy (7) of a data collection request (6), and to transmit the collected data set (9) to a release verification device (3), comprising - data supply units (4) configured to provide the at least one observation data element (8) of the data set (9) while the data supply units (4) are operated by users;

- a server (5) configured to collect the at least one observation data element (8) in the data set (9). Data supply unit (4) of a data collection system (1 ) according to claim

11 , configured to provide at least one observation data element (8) of a data set (9) of a predefined accuracy of a data set (9) to be collected according to a utility policy (7) of a data collection request (6) while the data supply unit (4) is operated by a user; Server (5) of a data collection system (1 ) according to claim 11 , configured to collect at least one observation data element (8) in a data set (9). Release verification device (3), configured to

- receive a data collection request (6), wherein the data collection request comprises a utility policy (7), defining the at least one observation data element (8) of the data set (9) and a predefined accuracy (10) of the at least one observation data element (8) of the data set (9) to be collected by the data collection system (1 );

- calculate a degree of a personal reference (17) of the data set (9) to be collected by the data collection system (1 ) according to a predefined a priory estimation method;

- Release the data collection request (6) for the data collection system (1 ), in case the degree of the personal reference (17) of the data set (9) to be collected fulfills a predefined a priory condition (23), wherein the a priory condition (23) is related to a reduced risk of assignability of one of data supply units (4) of the data collection system (1 ) to a particular user based on the data set (9) to be collected;

- Receive the collected data set (9) from the data collection system (1 ); - Calculate a degree of a personal reference (19) of the data set (9) collected by the data collection system (1 ) according to a predefined a posteriori estimation method; and

- Release the data set (9) collected by the data collection system (1 ), in case the degree of the personal reference (19) of the data set (9) collected by the data collection system (1 ) fulfills a predefined a posteriori condition (18), wherein the a posteriori condition (18) is related to a reduced risk of assignability of one of the data supply units (4) to a particular user based on the data set (9) col- lected by the data collection system (1 ).