DE102008027605A1

DE102008027605A1 - System and method for computer-based analysis of large amounts of data

Info

Publication number: DE102008027605A1
Application number: DE102008027605A
Authority: DE
Inventors: Ansgar Dr. Dorneich
Original assignee: INTELLIGEMENT AG
Current assignee: Amyam De GmbH
Priority date: 2008-06-10
Filing date: 2008-06-10
Publication date: 2010-01-14
Anticipated expiration: 2028-06-11
Also published as: WO2009149926A3; WO2009149926A2; DE102008027605B4

Abstract

Für ein Computersystem zur Datenanalyse soll die Trainingszeit durch technische Vorkehrungen signifikant reduziert werden; außerdem soll der benötigte Speicherbedarf durch den Einsatz technischer Maßnahmen nennenswert sinken. Dazu wird ein elektronisches Datenverarbeitungssystem zur Analyse von Daten vorgeschlagen, mit wenigstens einem Analyse-Server und wenigstens einem Vor-Ort-Client-Rechner. Der Analyse-Server ist dazu eingerichtet und programmiert, ein selbst adaptierendes Neuronen-Netz zu implementieren, das auf eine Datenbank mit einer Vielzahl Datensätzen mit vielen Merkmalen zu trainieren ist. Der Vor-Ort-Client-Rechner ist dazu eingerichtet, ihm zugeführte Daten einer Datenvorverarbeitung und/oder einer Datenkompression zu unterziehen, bevor die Daten von dem Vor-Ort-Client-Rechner über ein elektronisches Netzwerk an den Analyse-u eingerichtet und programmiert, mit den empfangenen, vorverarbeiteten/komprimierten Daten das selbst adaptierende Neuronen-Netz zu trainieren, indem die Daten dem sich selbst adaptierenden Neuronen-Netz wiederholt präsentiert werden und anschließend eine Analyse durchgeführt wird, um ein selbst adaptierendes Neuronen-Netz-Modell zu erstellen. Der Analyse-Server ist weiterhin dazu eingerichtet und programmiert, ein Versenden des selbst adaptierenden Neuronen-Netz-Modells von dem Analyse-Server an den Vor-Ort-Client-Rechner zu bewirken. Schließlich ist der Vor-Ort-Client-Rechner dazu ...For a computer system for data analysis, the training time should be significantly reduced by technical precautions; In addition, the required storage space should decrease appreciably by the use of technical measures. For this purpose, an electronic data processing system for analyzing data is proposed, with at least one analysis server and at least one on-site client computer. The analysis server is set up and programmed to implement a self-adapting neuron network that is to be trained on a database having a plurality of records with many features. The on-site client computer is adapted to subject data supplied thereto to data preprocessing and / or data compression before the data is set up and programmed by the on-site client computer via an electronic network to the analysis unit; using the received preprocessed / compressed data to train the self-adapting neuron network by repeatedly presenting the data to the self-adapting neuron network and then performing an analysis to create a self-adapting neuron-network model. The analysis server is further configured and programmed to effect dispatch of the self-adapting neuron-network model from the analysis server to the on-premises client computer. Finally, the on-site client machine is ...

Description

Hintergrundbackground

Derzeit verfügbare, kostengünstige Computerprogramme zur Datenanalyse (zum Beispiel DataCockpit^® 1.04) sind in der Analyse nennenswert langsamer als konkurrierende Data Mining Workbenches (SPSS und andere), können nur erheblich kleinere Datenmengen verarbeiten, und haben andere Nachteile (sie sind als monolithischer Block programmiert, sie sind in ihrer Architektur und Datenbehandlung ungeeignet zur Client-Server-Architektur, etc.).Currently available, inexpensive computer programs for data analysis (eg Data Cockpit ^® 1:04) are in the analysis significantly slower than competing data mining workbenches (SPSS and others), only significantly smaller amounts of data to process, and have (other disadvantages they are programmed as a monolithic block, they are unsuitable in their architecture and data handling for client-server architecture, etc.).

Zur Segmentierung oder zur Vorhersage werden Daten auf ein ein-, zwei- oder dreidimensionales selbstadaptierendes Neuronen-Netz (self organizing map, ,SOM') abgebildet. [ T. Kohonen. Self-Organization and Associative Memory, vol. 8 of Springer Series in Information Science, 3rd edition, Springer-Verlag, Berlin, 1989 ].For segmentation or prediction, data is mapped to a one-, two- or three-dimensional self-adapting neuron network ("SOM"). [ T. Kohonen. Self-Organization and Associative Memory, vol. 8 of Springer Series in Information Science, 3rd edition, Springer-Verlag, Berlin, 1989 ].

Bei der der SOM-basierten Datenanalyse werden das sogenannte ,Kohonen Clustering' und die sogenannte SOM-Karten-Analyse unterschieden. Das Kohonen Clustering arbeitet nur mit sehr wenigen Neuronen, typischerweise zwischen etwa 4 und etwa 20 Neuronen. Jedes dieser Neuronen repräsentiert einen ,Cluster', also eine homogene Gruppe von Datensätzen. Diese Technik wird vor allem zur Datensegmentierung eingesetzt und ist in vielen Data Mining Softwarepaketen implementiert, zum Beispiel in SPSS Clementine oder IBM DB2 Warehouse. (siehe zum Beispiel Ch. Ballard et al., Dynamic Warehousing: Data Mining Made Easy, IBM Redbook, 2007 ).In the SOM-based data analysis, the so-called 'Kohonen clustering' and the so-called SOM map analysis are differentiated. Kohonen clustering works with very few neurons, typically between about 4 and about 20 neurons. Each of these neurons represents a 'cluster', ie a homogeneous group of data sets. This technique is primarily used for data segmentation and is implemented in many data mining software packages, such as SPSS Clementine or IBM DB2 Warehouse. (see for example Ch. Ballard et al., Dynamic Warehousing: Data Mining Made Easy, IBM Redbook, 2007 ).

Die SOM-Karten-Analyse benutzt demgegenüber relativ große Neuronennetze von zum Beispiel 30·40 Neuronen zur Datenanalyse. Hierbei werden homogene Datensegmente durch lokale Gruppen von Neuronen mit ähnlichen Merkmalsausprägungen repräsentiert. SOM-Karten werden zur Datenexploration, Segmentierung, Vorhersage, Simulation und Optimierung verwendet (siehe zum Beispiel R. Otte, V. Otte, V. Kaiser, Data Mining für die industrielle Praxis, Hauser Verlag, München, 2004 ).In contrast, the SOM map analysis uses relatively large neural networks of, for example, 30 x 40 neurons for data analysis. Homogeneous data segments are represented by local groups of neurons with similar characteristics. SOM maps are used for data exploration, segmentation, prediction, simulation and optimization (see for example R. Otte, V. Otte, V. Kaiser, Data Mining for Industrial Practice, Hauser Verlag, Munich, 2004 ).

Als Beispiele für weiteren technologischen Hintergrund seien die EP 97 11 56 54.2 und die EP 97 12 0787.3 genannt.As examples of further technological background are the EP 97 11 56 54.2 and the EP 97 12 0787.3 called.

Um eine umfangreiche, auf einem Computer zusammengetragene Datensammlung – zum Beispiel Produktionsdaten aus einer Fertigungsanlage mit etwa 10⁴ bis 10¹⁰ Datensätzen und etwa 3 bis 1000 Merkmalen pro Datensatz – zu analysieren und ggf. die Ergebnisse der Analyse in den Fertigungsablauf zurückfließen zu lassen, werden die vorhandenen Datensätze immer wieder einem lernenden und sich selbst adaptierenden Neuronen-Netz präsentiert.To analyze an extensive collection of data collected on a computer - for example, production data from a production facility with approximately 10 ⁴ to 10 ¹⁰ data sets and approximately 3 to 1000 characteristics per data record - and, if necessary, to return the results of the analysis to the production process the existing data sets are repeatedly presented to a learning and self-adapting neuron network.

Dabei kann es sich um Produktionsdaten in der Maschinenbau-, Chemie-, Automobil-, Zuliefererindustrie handeln: Zum Beispiel 10 Millionen produzierte Einheiten, 10 nominale Komponenten- und Produktionslinien-Informationen, 10 binäre Komponenten- und Ausstattungsinformationen, 10 nummerische Produktionsdaten (gemessene Toleranzdaten, Sensordaten, erfasste Produktionszeiten, Maschinendaten,...) Ziel der SOM-Analyse ist hier die Qualitätssicherung, Fehlerquellenanalyse, Frühwarnung, Produktionsprozess-Optimierung. Ein anderes Beispiel wären Kundendaten in Einzelhandels-, Finanz- oder Versicherungsunternehmen: 10 Millionen Kunden, 10 nominale demografische Merkmale (Familienstand, Berufsgruppe, Region, Wohnungstyp, ...), 10 binäre Merkmale über Interessen und in Anspruch genommene Dienstleistungen/Produkte (Geschlecht; besitzt Kreditkarte; betreibt Online-Banking, ...), 10 nummerische Merkmale (Jahreseinkommen, Alter, Jahresumsatz, Kreditwürdigkeit, ...). Ziel der SOM-Analyse ist hier die Kundensegmentierung, die Vorhersage von Kundenwert, Kreditwürdigkeit, Schadensrisiko, ... sowie die Optimierung von Marketingkampagnen.there can be production data in mechanical, chemical, Automotive, supplier industry trade: For example, 10 million units produced, 10 nominal component and production line information, 10 binary component and equipment information, 10 numerical production data (measured tolerance data, sensor data, recorded production times, machine data, ...) Objective of the SOM analysis here is the quality assurance, error source analysis, Early warning, production process optimization. Another Example would be customer data in retail, financial or Insurance companies: 10 million customers, 10 nominal demographic Characteristics (marital status, occupational group, region, type of dwelling, ...), 10 binary features about interests and avails taken services / products (sex; owns credit card; operates online banking, ...), 10 numerical characteristics (annual income, Age, annual turnover, creditworthiness, ...). Objective of the SOM analysis Here is the customer segmentation, the prediction of customer value, Creditworthiness, risk of damage, ... as well as optimization of marketing campaigns.

Jedes Neuron des sich selbst adaptierenden Neuronen-Netzes hat so viele Signaleingänge, wie jeder der einzelnen Datensätze Merkmale hat. Hat das Neuronen-Netz die Daten ,gelernt', können mit dem trainierten Neuronen-Netz unter Anderem folgende Aufgaben abgearbeitet werden:

• Visuelle interaktive Datenexploration: Interaktives Entdecken von interessanten Untergruppen, Korrelationen zwischen Merkmalen und allgemeinen Zusammenhängen mit Hilfe von verschiedenen Visualisierungen der Daten, welche aus selbstorganisierenden Merkmalskarten erzeugt werden.
• Segmentierung: Einteilen der gesamten Daten in homogene Gruppen.
• Vorhersage: Vorhersage von bisher unbekannten Merkmalsausprägungen in einzelnen Datensätzen.
• Simulation: Wie würden sich gewisse Merkmalsausprägungen eines Datensatzes wahrscheinlich ändern, wenn bestimmte andere Merkmalsausprägungen gezielt geändert würden?
• Optimierung: Wenn für eine Teilmenge der Merkmale bestimmte optimale Ausprägungen erreicht werden sollen, wie sollten dann die übrigen Merkmalsausprägungen gewählt werden?

Each neuron of the self-adapting neuron network has as many signal inputs as each of the individual data sets has characteristics. If the neuron network has 'learned' the data, the following tasks can be performed with the trained neuron network, among others:

• Visual interactive data exploration: interactive discovery of interesting subgroups, correlations between features and general contexts using various visualizations of the data generated from self-organizing feature maps.
Segmentation: Divide all data into homogeneous groups.
• Prediction: prediction of previously unknown characteristic values in individual data sets.
• Simulation: How would certain characteristics of a data set likely change? if certain other characteristic values were changed in a targeted way?
• Optimization: If certain optimal characteristics are to be achieved for a subset of the characteristics, how should the other characteristic values be selected?

Bestehende Methoden und Implementierungen SOM-Karten-basierter Datenanalyse benötigen für die kommerzielle Einsetzbarkeit derzeit zu lange Trainingszeiten der Neuronen-Netze. Diese übersteigen die Trainingszeiten anderer Data Mining Techniken auf denselben Daten um etwa das Hundertfache und behindern die Anwendung derartiger existierender Software-Pakete auf viele existierende Datensammlungen und Fragen mit der gegenwärtig zur Verfügung stehenden Rechnerleistung. Um zum Beispiel mit der Software DataCockpit ein SOM-Netzwerk von 30·40 Neuronen auf einer großen Datenbank von 60.000.000 Datensätzen mit 100 Merkmalen zu trainieren, müsste ein Server mit ein bis zwei 3 GHz Intel^® CPUs, 64 GigaByte RAM) etwa 2–3 Monate ununterbrochen rechnen – dies wäre in der Praxis völlig inakzeptabel.Existing methods and implementations of SOM-card-based data analysis currently require too long training times of the neuron networks for their commercial applicability. These exceed the training times of other data mining techniques on the same data by about a hundredfold and hinder the application of such existing software packages to many existing data collections and issues with the currently available computer power. For example, to train a SOM network of 30 · 40 neurons on a large database of 60,000,000 data sets with 100 characteristics using the DataCockpit software, a server with one to two 3 GHz ^Intel® CPUs, 64 GigaByte RAM would need about 2 Calculate continuously for three months - this would be completely unacceptable in practice.

Technisches ProblemTechnical problem

So besteht die technische Anforderung, diese Trainingszeit durch technische Vorkehrungen signifikant zu reduzieren. Außerdem sollte der benötigte Speicherbedarf durch den Einsatz technischer Maßnahmen nennenswert sinken; das oben genannte Beispiel sollte für die Ausführung einen Hauptspeicher mit wenigen GigaByte RAM erfordern.So is the technical requirement, this training time by technical Significantly reduce precautionary measures. In addition, should the required storage space through the use of technical Measures fall significantly; the above example should be main memory for execution with a few gigabytes of RAM.

KurzbeschreibungSummary

Zur Problemlösung wird ein elektronisches Datenverarbeitungssystem zur Analyse von Daten vorgeschlagen, mit wenigstens einem Analyse-Server und wenigstens einem Vor-Ort-Client-Rechner, wobei der Analyse-Server dazu eingerichtet und programmiert ist, ein selbst adaptierendes Neuronen-Netz zu implementieren, das auf eine große Datenbank mit einer Vielzahl Datensätzen mit vielen Merkmalen zu trainieren ist, wobei der Vor-Ort-Client-Rechner dazu eingerichtet und programmiert ist, ihm zugeführte Daten einer Datenvorverarbeitung und/oder einer Datenkompression zu unterziehen, bevor die Daten von dem Vor-Ort-Client-Rechner über ein elektronisches Netzwerk an den Analyse-Server gesendet werden, wobei der Analyse-Server dazu eingerichtet und programmiert ist, mit den empfangenen, vorverarbeiteten/komprimierten Daten das selbst adaptierende Neuronen-Netz zu trainieren, in dem die Daten dem sich selbst adaptierenden Neuronen-Netz wiederholt präsentiert werden und anschließend eine Analyse durchgeführt wird um ein selbst adaptierende Neuronen-Netz-Modell zu erstellen, und wobei der Analyse-Server dazu eingerichtet und programmiert ist, ein Versenden des selbst adaptierenden Neuronen-Netz-Modells von dem Analyse-Server an den Vor-Ort-Client-Rechner zu bewirken, und der Vor-Ort-Client-Rechner dazu eingerichtet und programmiert ist, die Daten des selbst adaptierenden Neuronen-Netz-Modells einer Dekomprimierung zu unterziehen.to Problem solving becomes an electronic data processing system proposed for the analysis of data, with at least one analysis server and at least one on-premises client computer, wherein the analysis server is set up and programmed, a self-adaptive Neuron network to implement that on a large database with a variety of records with many features too train, with the on-site client machine set up and is programmed, data supplied to him data preprocessing and / or undergo data compression before the data from the on-premises client computer via an electronic one Network will be sent to the analysis server, with the analysis server is set up and programmed with the received, preprocessed / compressed Data to train the self-adaptive neuron network in which the data is repeated to the self-adapting neuron network presented and then an analysis is performed around a self-adapting neuron network model to create, and where the analysis server is set up and programmed, a sending of the self-adapting neuron network model from the analysis server to the on-premises client machine, and the on-premises client machine is set up and programmed to the data of the self-adapting neuron-network model of a decompression to undergo.

Diese Anordnung hat die technische Wirkung, die Effizienz und die Sicherheit der Datenanalyse zu erhöhen. Eine weitere technische Wirkung besteht darin, die Anforderungen an die erforderlichen Computerressourcen gegenüber der herkömmlichen Vorgehensweise zu senken. Schließlich wird die Datenübertragungsgeschwindigkeit und die anschließende Datenverarbeitung positiv beeinflusst.These Arrangement has the technical effect, efficiency and safety to increase the data analysis. Another technical effect is the requirements for the required computer resources compared to the conventional approach too reduce. Finally, the data transfer speed and positively influenced the subsequent data processing.

Die Art der Datenkompression kann an den Aufbau der Daten (boolesch, nummerisch, textuell, etc.) angepasst sein. Dies erlaubt, unterschiedlich strukturierte oder auf verschiedene Weise erfasste Quelldaten (z. B. Flachdateien, Datenbanktabellen, Excel-Tabellen) in eine komprimierte Form zu transformieren, welche nur etwa 5% bis etwa 12% des Speicherbedarfes der Originaldaten hat. Da auch nur diese komprimierte Form der Daten von dem Vor-Ort-Client-Rechner an den Analyse-Server gesendet wird, ist als weiterer technischer Vorteil ein schnellerer Datentransfer mit geringerer Anforderung an den Datenkanal möglich. Die von dem Datentyp abhängige Kompression der Originaldaten bewirkt eine gleichzeitige Anonymisierung der Daten. Die Kompression kann außerdem so erfolgen, dass die Genauigkeit der Daten bei der Kompression/Dekompression das Ergebnis der Analyse nicht ungebührlich verfälscht.The The type of data compression may depend on the structure of the data (Boolean, numeric, textual, etc.). This allows, different structured or differently collected source data (e.g. Flat files, database tables, Excel spreadsheets) into a compressed one Transforming form, which only about 5% to about 12% of the memory requirements the original data has. Because only this compressed form of the data sent from the on-site client machine to the analysis server, is a further technical advantage a faster data transfer with less request to the data channel possible. The data type dependent compression of the original data causes a simultaneous anonymization of the data. The compression can also be done so that the accuracy of the data in the compression / decompression the result of the analysis is not improperly falsified.

Die komprimierte Datenform ist sehr gut geeignet für Neuronen-Netz-Analysen, aber auch für schnelle interaktive Datenexploration, z. B. durch multivariate Statistiken, bei denen die Ergebnisse in Echtzeit oder beinahe Echtzeit (mit geringer Wartezeit – weniger als einige zehn Sekunden) vorliegen sollen.The compressed data form is very well suited for neuronal network analyzes, but also for fast interactive data exploration, eg. Eg through multivariate statistics, where the results are in real time or almost real-time (with little wait - less than a few tens of seconds).

Unter Bezugnahme auf 1 dient ein elektronisches Datenverarbeitungssystem zur Analyse von Daten. Das elektronische Datenverarbeitungssystem hat einen Analyse-Server 10 und einen oder mehrere Vor-Ort-Client-Rechner 12. Der Analyse-Server ist zum Beispiel ein PC mit mehreren 3 GHz Intel^® CPUs und 64 GigaByte RAM als Hauptspeicher. Darin ist ein selbst adaptierendes Neuronen-Netz als Datenobjekt zu implementieren, das auf eine große Datenbank mit einer Vielzahl Datensätzen mit vielen Merkmalen zu trainieren ist. Der Vor-Ort-Client-Rechner 12 ist dazu eingerichtet und programmiert, ihm zugeführte Daten einer Datenvorverarbeitung und/oder einer Datenkompression zu unterziehen, bevor die Daten über ein elektronisches Netzwerk 14, zum Beispiel das Internet, an den Analyse-Server 10 gesendet werden. Der Analyse-Server 10 ist außerdem dazu eingerichtet und programmiert, mit den empfangenen, vorverarbeiteten/komprimierten Daten das selbst adaptierende Neuronen-Netz zu trainieren, indem die Daten dem sich selbst adaptierenden Neuronen-Netz wiederholt präsentiert werden und anschließend eine Analyse durchzuführen um ein selbst adaptierende Neuronen-Netz-Modell zu erstellen. Der Analyse-Server bewirkt anschließend ein Versenden des selbst adaptierenden Neuronen-Netz-Modells von dem Analyse-Server 10 an den Vor-Ort-Client-Rechner 12 ebenfalls über das Netzwerk 14. Der Vor-Ort-Client-Rechner 12 ist schließlich dazu eingerichtet und programmiert, die Daten des selbst adaptierenden Neuronen-Netz-Modells einer Dekomprimierung zu unterziehen.With reference to 1 serves an electronic data processing system for the analysis of data. The electronic data processing system has an analysis server 10 and one or more on-premises client computers 12 , The analysis server is for example a PC with several 3 GHz ^Intel® CPUs and 64 GigaByte RAM as main memory. This is a self-adapting neuron network as a data object in the which is to train on a large database with a large number of records with many features. The on-site client machine 12 is adapted and programmed to subject data supplied to it to data preprocessing and / or data compression before the data is transmitted over an electronic network 14 , for example the Internet, to the analysis server 10 be sent. The analysis server 10 is also set up and programmed to train the self-adapting neuron network with the received, preprocessed / compressed data by repeatedly presenting the data to the self-adapting neuron network and then performing an analysis to construct a self-adapting neuron network. Create model. The analysis server then causes the self-adapting neuron network model to be sent by the analysis server 10 to the on-site client machine 12 also over the network 14 , The on-site client machine 12 Finally, it is set up and programmed to decompress the data of the self-adapting neuron network model.

Die Datenkomprimierung kann für mehrere Arten mächtiger interaktiver Datenanalysen und Datenexplorationstechniken benutzt werden, die selbst auf großen Datenquellen von mehr als einem Gigabyte Größe noch ein interaktives Arbeiten in Echtzeit erlauben. Bei einer interaktiven multivariaten Statistik werden für mehrere oder alle Merkmale einer Datensammlung Werteverteilungsdiagramme (Histogramme) nebeneinander auf dem Bildschirm angezeigt. Wenn man in einer interaktiven multivariaten Statistik in einem oder mehreren der Diagramme einen Teil der Histogrammbalken selektiert, werden in den anderen Diagrammen sofort die verbleibenden Häufigkeiten angezeigt. Dies erlaubt einen sehr flexiblen 'drill down' in die Daten, ohne mühsam einen multidimensionalen OLAP-Cube aufbauen und pflegen zu müssen. Das Problem ist, dass sich auf großen Datenbeständen die Antwortzeiten stark verlangsamen. Die Software IBM DB2 Data Warehouse Edition trägt dem Rechnung, indem sie die multivariate Analyse standardmäßig nur auf einem Datenraum von 1000 Datensätzen durchführt. Damit kann man aber nicht erwarten, auf Tabellen von z. B. 10⁶–10⁸ Datensätzen korrekte und verlässliche Ergebnisse zu erhalten. Die vorgestellte Datenkomprimierung bietet einen eleganten Ausweg. Wenn die Datensammlung komprimiert und in komprimierter Form in den Hauptspeicher geladen ist, kann man die multivariate Analyse auch auf Datensammlungen von mehreren Gigabyte Originalgröße noch ohne Sampling in Echtzeit durchführen.Data compression can be used for multiple types of powerful interactive data analysis and exploration techniques that allow for real-time interactive work even on large data sources larger than one gigabyte in size. Interactive multivariate statistics display value distribution diagrams (histograms) side by side on the screen for several or all characteristics of a data collection. If you select one part of the histogram bars in one or more of the diagrams in an interactive multivariate statistic, the remaining frequencies are immediately displayed in the other diagrams. This allows for a very flexible 'drill down' into the data without the hassle of building and maintaining a multidimensional OLAP cube. The problem is that response times slow down on large volumes of data. The IBM DB2 Data Warehouse Edition software takes this into account by performing multivariate analysis by default only on a dataset of 1000 datasets. But you can not expect to see tables of z. B. 10 ⁶ -10 ⁸ records to get correct and reliable results. The presented data compression offers an elegant way out. If the data collection is compressed and loaded into main memory in a compressed form, multivariate analysis can be performed even on real-time data collections of several gigabytes in size without sampling.

Die beschriebene Vorgehensweise erlaubt die Behandlung großer Datenmengen mit einer signifikanten Erhöhung des analysierbaren Datenvolumens. Bei textuellen (nominalen) Daten ist eine drastische Reduzierung des Speicherplatzbedarfs möglich, und die Analysegeschwindigkeit bei überwiegend nicht-nummerischen Daten nimmt signifikant zu. Die Analysege schwindigkeit steigt außerdem durch die Beschleunigung des Trainings der SOM-Modelle. Die Daten-Anonymisierung erfolgt durch die Aufteilung der Daten in einen vertraulichen und einen nicht-vertraulichen Teil. Nur der nicht-vertrauliche Teil wird dem Analyse-Server übermittelt und wird von diesem analysiert. Der vertrauliche Teil bleibt auf dem Vor-Ort-Client. Dieser kann dazu eingerichtet und programmiert sein, bei Eintreffen des anonymisierten Analyse-Ergebnisses vom Analyseserver dieses anonymisierten Ergebnis mit dem vertraulichen Teil der Daten zu einem deanonymisierten Klartext-Analyseergebnis zusammenzuführen.The described procedure allows the treatment of large Data sets with a significant increase in the analysable Data volume. Textual (nominal) data is drastic Reduction of storage space required, and the analysis speed for predominantly non-numerical data decreases significantly to. The speed of analysis also increases Speeding up the training of the SOM models. The data anonymization is done by dividing the data into a confidential and a non-confidential part. Only the non-confidential part is transmitted to the analysis server and is used by this analyzed. The confidential part remains on the on-site client. This can be set up and programmed on arrival of the anonymized analysis result from the analysis server this anonymized result with the confidential part of the data too to merge a deanonymized plaintext analysis result.

Bei der Anonymisierung vertraulicher Daten wird eine Flachdatei im weitesten Sinne (d. h. z. B. eine Komma-, Semikolon-, oder Tabulator-separierte Textdatei mit variabler Spaltenbreite, eine Textdatei mit fester Spaltenbreite, eine Tabelle aus einem Tabellenkalkulationsprogramm wie Microsoft^® Excel^® oder OpenOffice^®, eine Tabelle in einer relationalen, objektorientierten oder XML-Datenbank usw.) komprimiert und dabei gleichzeitig alle potenziell vertraulichen Informationen aus den Daten entfernt. Die vertraulichen Informationen werden in einer separaten Datenbeschreibung gespeichert. Die herausgefilterten Informationen sind zum Beispiel die folgenden: erstens Merkmalsnamen, die ersetzt werden durch die anonymisierte Namen wie z. B. C0, C1, ..., D0, D1, ..., B0, B1, ..., N0, N1, ..., wobei C für ,continuous numeric' steht, D für ,discrete numeric', B für binary (zweiwertig), N für nominal (textuell); zweitens textuelle Merkmalsausprägungen, die ersetzt werden durch anonymisierte Wertausprägungen wie z. B. V, V1, ... oder VALUE0, VALUE1, ..., und drittens nummerische Wertausprägungen, deren tatsächliche Wertebereiche auf eine normierte Verteilung mit Mittelwert m, m = 0 und Streubreite s, s = 1 transformiert werden. Weitere mögliche Anonymisierungen sind zum Beispiel bei nummerischen Merkmalen nicht nur die ersten beiden Momente der Verteilung, m und s, sondern auch noch höhere Momente wie Schiefe oder Kurtosis.In the anonymizing sensitive data is a flat file, in the broadest sense (ie, for example, a comma, semicolon, or tab-delimited text file with variable column width, a text file with a fixed column width, a table of a spreadsheet program such as Microsoft ^® Excel ^® or Open Office ^® , a table in a relational, object-oriented, or XML database, etc.) while removing all potentially sensitive information from the data. The confidential information is stored in a separate data description. The information filtered out is, for example, the following: first, feature names that are replaced by the anonymized names C0, C1, ..., D0, D1, ..., B0, B1, ..., N0, N1, ..., where C stands for 'continuous numeric', D for 'discrete numeric', B for binary, N for nominal (textual); Secondly, textual characteristics which are replaced by anonymised values such as: B. V, V1, ... or VALUE0, VALUE1, ..., and thirdly numerical values whose actual value ranges are transformed to a normalized distribution with mean m, m = 0 and spread s, s = 1. Other possible anonymizations, for example, in numerical features are not only the first two moments of distribution, m and s, but also even higher moments such as skewness or kurtosis.

Nur die anonymisierten komprimierten Daten werden zum Analyse-Server übermittelt, der daraus ein anonymisiertes SOM-Modell erstellt und zurückschickt. Ein anonymisiertes SOM-Modell wird wieder deanonymisiert, indem die vertraulichen Informationen aus der separaten Datenbeschreibung mit dem SOM-Modell rekombiniert werden. Auch wenn vorstehend die Trennung des Analyse-Servers von dem Vor-Ort-Client-Rechner angenommen wurde, ist es auch möglich, die in den beiden Einheiten vorgehaltenen Rechnerleistungen und Softwareprogrammkomponenten in einer Rechnereinheit zusammenzufassen.Just the anonymized compressed data is sent to the analysis server, from which an anonymous SOM model is created and sent back. An anonymized SOM model is again deanonymized by the confidential information from the separate data description be recombined with the SOM model. Even if the above Disconnected the analysis server from the on-premises client machine it is also possible that in the two units reserved computer services and software program components to summarize in a computer unit.

Die oben beschriebene Anonymisierung der Daten beim Komprimierungsvorgang macht den Einsatz einer Client-Server-Architektur zur Datenanalyse auch für vertrauliche Daten möglich: Weil eine Softwareprogrammkomponente für die Komprimierung der Daten im Client vorgesehen ist, kann zum Beispiel ein Produktionsbetrieb, der eine Qualitätsanalyse/-Verbesserung seiner Produktionsabläufe durchführen möchte, beim Betreiber des Analyse-Servers mit der Softwareprogrammkomponente zum Trainieren des selbst adaptierenden Neuronen-Netzes mit Hilfe der empfangenen, komprimierten Daten und der Softwareprogrammkomponente zum Ausführen einer Analyse, auf seinem eigenen Vor-Ort-Client-Rechner die Daten zunächst vorverarbeiten/komprimieren (und dabei anonymisieren), bevor er sie zum Analyse-Server schickt, um dort die Analyse mit den vorverarbeiteten/komprimierten, anonymisierten Daten durchführen zu lassen. Der Analyse-Server schickt das anonymisierte Neuronen-Netz-Modell zurück, und der Vor-Ort-Client-Rechner ersetzt die anonymisierten Werte darin wieder durch die Originalwerte.The Anonymization of the data in the compression process described above makes use of a client-server architecture for data analysis also possible for confidential data: Because one Software program component for compressing the data is provided in the client, for example, a production plant, the quality analysis / improvement of his production processes want to perform at the operator of the analysis server with the software program component for training the self-adapting Neuron network using the received, compressed data and the software program component to perform an analysis, on his own on-site client machine the data first preprocessing / compressing (and thereby anonymizing) before he send them to the analysis server to do the analysis with the preprocessed / compressed, anonymous data. The analysis server sends back the anonymized neuron network model, and the on-premises client machine replaces the anonymized values in it again through the original values.

Zusätzlich können die anonymisierten Daten während des Übermittlungsvorgangs über ein Netzwerk noch verschlüsselt werden. Die vorstehend beschriebene Datenanonymisierung ist vollkommen verträglich mit herkömmlichen sicheren Übertragungsprotokollen und Verschlüsselungsverfahren wie z. B. PBP, https oder scp. Dabei ist eine Verschlüsselung der zuvor der Anonymisierung unterworfenen Daten jedoch nicht unbedingt erforderlich. Der technische Vorteil der Anonymisierung ist, dass die neben der Verringerung des Datenumfangs (bezogen auf die erfassten Originaldaten) auch noch Vertraulichkeit der Daten nicht nur während des Transfers über das Netzwerk gewahrt bleibt, sondern auch während der gesamten Analyse auf dem Analyseserver. So kann eine weitere Ressourcen erfordernde Verschlüsselung und Entschlüsselung für die Übertragung zwischen dem Analyseserver und dem Vor-Ort-Client eigentlich unterbleiben, da die anonymisierten Daten ohne die Korrelierung zu dem Klartextanteil unverständlich sind. Ein Einblick in die vertraulichen Daten bleibt damit nicht nur einem potenziellen Mitleser während der Übertragung verwehrt, sondern auch dem Betreiber des Analyse-Servers.additionally The anonymous data may be transmitted during the transmission process a network will still be encrypted. The above described data anonymization is fully compatible with conventional secure transmission protocols and encryption methods such. PBP, https or scp. Here is an encryption of the previously anonymization However, subject data is not necessarily required. The technical Advantage of anonymization is that in addition to the reduction the scope of data (based on the collected original data) also still confidentiality of the data not only during the transfer over the network is maintained, but throughout the entire Analysis on the analysis server. So can another resource requiring Encryption and decryption for the transfer between the analysis server and the on-site client actually fail because the anonymized data without the correlation are incomprehensible to the plain text portion. An insight in the confidential data thus remains not only a potential Reader is denied during the transfer, but also the operator of the analysis server.

Der Analyse-Server kann dazu eingerichtet und programmiert sein, das selbst adaptierende Neuronen-Netz so oft mit den empfangenen anonymisierten Daten zu trainieren, bis sich ein auskonvergierter Netzzustand ergibt, der die Daten angemessen repräsentiert. Vorzugsweise werden die Daten dem selbst adaptierenden Neuronen-Netz etwa 100 bis etwa 200 Mal präsentiert.Of the Analysis Server can be set up and programmed to do this self-adapting neuron network so often with the received anonymized Train data until an out-converged mesh condition results, who adequately represents the data. Preferably the data to the self-adapting neuron network about 100 to about Presented 200 times.

Der Vor-Ort-Client-Rechner kann dazu eingerichtet und programmiert sein, die ihm zugeführten Daten im Umfang von bis zu etwa 10 Gigabyte bis mehreren Terabyte der Datenvorver arbeitung und der Datenkompression zu unterziehen. Die genannten Datengrößen beziehen sich auf typische große Datenbanktabellen und die verfügbare Computertechnologie des Jahres 2008. Wenn die allgemeine Computerleistungsfähigkeit und Datenbankgröße weiterhin exponentiell steigt (,Moore's Law'), werden die genannten Datengrößen proportional mitwachsen.Of the On-site client machine can be set up and programmed to the data supplied to it in the scope of up to about 10 Gigabytes to several terabytes of data preprocessing and the Subject to data compression. The mentioned data sizes refer to typical large database tables and the available computer technology of the year 2008. If the general computer performance and database size continues to grow exponentially ('Moore's Law'), the mentioned Data sizes grow proportionally.

Der Analyse-Server kann außerdem dazu eingerichtet und programmiert sein, zusätzlich oder anstelle der SOM-Modellierung auch weitere Data-Mining- oder Datenanalyse-Verfahren bereitzuhalten, zum Beispiel Assoziationsregel-Verfahren, Entscheidungsbaumverfahren, Bayessche Verfahren, Regressionsverfahren oder weitere neuronale Analyseverfahren rieben dem SOM-Verfahren. Auch diese genannten Verfahren und viele weitere können direkt auf dem vorstehend beschriebenen komprimierten Datenformat aufsetzen und dadurch signifikant schneller ablaufen.Of the Analysis Server can also be set up and programmed be, in addition to or instead of SOM modeling, too to provide further data mining or data analysis procedures for example, association rule methods, decision tree methods, Bayesian methods, regression methods or other neural Analytical methods rubbed the SOM method. Also these mentioned Procedures and many more can be found directly on the above set up compressed data format described and thereby significantly run faster.

Der Vor-Ort-Client-Rechner kann dazu eingerichtet und programmiert sein, die ihm zugeführten Daten bei der Datenvorverarbeitung einmal zu lesen, und darin enthaltene Originalmerkmale auf rein nummerische normalisierte Merkmale zu transformieren.Of the On-site client machine can be set up and programmed to the data supplied to him during data preprocessing to read once, and contained in it original features on pure to transform numerical normalized features.

Der Vor-Ort-Client-Rechner kann dazu eingerichtet und programmiert sein, bei der Datenkompression die normalisierten nummerischen Merkmalsausprägungen der ihm zugeführten Daten komprimiert zu speichern, so dass im Mittel nur zwischen zwei Bit und etwa einem Byte als Speicherplatz pro Merkmalsausprägung benötigt wird.Of the On-site client machine can be set up and programmed to in the data compression, the normalized numerical feature values to store the data supplied to it compressed, so that on average only between two bits and about one byte as storage space is required for each characteristic characteristic.

Damit können einerseits die komprimierten Daten bei vielen aufeinander folgenden Analysen – die eventuell mit verschiedenen Analyseverfahren durchgeführt werden – durch ein einmaliges Laden komplett in den Arbeitsspeicher des Analyse-Servers zur Verarbeitung durch die Softwareprogrammkomponente zum Trainieren des selbst adaptierenden Neuronen-Netzes (und/oder eines anderen Analyseverfahrens) und der Softwareprogrammkomponente zum Ausführen einer Analyse geladen werden. Ein wiederholtes, zeitaufwändiges Laden der Daten für jeden Analyseschritt erübrigt sich dadurch. Andererseits erlaubt die angepasste, hohe Kompressionsrate auch, die normalisierten, komprimierten Daten zusätzlich zu den Originaldaten persistent auf dem Massenspeicher zu halten. So kann das sonst übliche 100- bis 200-malige datensatzweise Einlesen, Parsen und in ein für die Analyse geeignetes Format Bringen der Daten entfallen, selbst wenn die komprimierten Daten nicht komplett in den Arbeitsspeicher des verwendeten Computers passen.Thus, on the one hand, the compressed data in many subsequent analyzes - which may be performed with different analysis methods - by a single loading completely into the memory of the analysis server for processing by the software program component for training the self-adapting neuron network (and / or a another analysis method) and the software program component to perform an analysis. Repeated, time-consuming loading of the data for each analysis step is therefore unnecessary. On the other hand, the adjusted, high compression rate also allows the normalized, compressed data in addition to the original persis data tent on the mass storage. Thus, the usual 100- to 200-times record-by-record reading, parsing and in a suitable format for the analysis format data can be omitted, even if the compressed data does not fit completely in the memory of the computer used.

Die Kombination aus Einlesen eines Datensatzes von Festplatte als Zeichenkette und anschließendem Parsen inklusive Abbilden von Zeichenketten auf nummerische Werte kann etwa 10 000 Mal so lange dauern wie der Zugriff auf einen komprimierten, bereits geparsten und normalisierten Datensatz im Arbeitsspeicher (wobei sich ein Faktor von etwa 1000 durch die höhere Zugriffsgeschwindigkeit, und ein Faktor von etwa 10 durch die auf etwa 5% bis etwa 12% reduzierte/komprimierte Größe der Daten ergibt).The Combination of reading a record from disk as a string followed by parsing including mapping of strings numerical values can take about 10,000 times as long as that Access to a compressed, already parsed and normalized Record in memory (where a factor of about 1000 due to the higher access speed, and a factor from about 10 to about 5% to about 12% reduced / compressed Size of the data yields).

Der Vor-Ort-Client-Rechner kann weiterhin dazu eingerichtet und programmiert sein, in Abhängigkeit von einem für die jeweilige Analyseaufgabe akzeptablen Kompressionsfehler ein zu verwendendes Datenkompressionsverfahren festzulegen, wobei das einzusetzende Kompressionsverfahren und die zu erzielende Kompressionrate abhängig von den unterschiedlichen Merkmalstypen (boolesch, nummerisch, nominal (textuell)) und von den Genauigkeitsanforderungen der gewählten Analysetechnik zu wählen ist.Of the On-site client machine can continue to set up and programmed be, depending on one for each Analysis task acceptable compression error one to use Specify the data compression method, whereby the Compression method and the compression rate to be achieved from the different feature types (boolean, numeric, nominal (textual)) and the accuracy requirements of the selected Analysis technology is to choose.

Bei Einsatz der SOM-Analysetechnik kann der Vor-Ort-Client-Rechner dabei so eingerichtet und programmiert sein, dass der mittlere Vorhersagefehler für nummerische Merkmale – also die Differenz zwischen dem tatsächlichen normalisierten Merkmalswert eines Datensatzes und dem normalisierten Wert, den das insgesamt am besten zu dem Datensatz passende Neuron für das Merkmal vorhersagt – bei sinnvoll austrainierten SOM-Netzen meist zwischen 0.01 und 0.1 liegt. Der Vorteil ist, dass das Netz auf diese Weise nicht jede zufällig in einem bestimmten Datensatz vorhandene Merkmalsausprägung exakt zu reproduzieren versucht, und dass somit zufällige Schwankungen und Koinzidenzen in den Trainingsdatensätzen nicht als allgemeine Gesetzmäßigkeiten in den Daten gelernt werden (sogenanntes ,Übertrainieren' des Netzes). Vielmehr liefert bei Zulassen eines Fehlers das selbst adaptierende Neuronen-Netz auch dann brauchbare Aussagen, wenn es auf neue Datensätze angewendet wird, die noch nicht in den Trainingsdaten enthalten waren.at The SOM analysis technology can be used by the on-site client computer be set up and programmed so that the mean prediction error for numerical features - that is the difference between the actual normalized feature value of a record and the normalized value that the total best suited to the record matching neuron for the trait predicts - in sensibly well-trained SOM networks mostly between 0.01 and 0.1. The advantage is that the network is up this way, not every one happens to be in a given record tries to reproduce existing characteristic expression exactly, and that thus random fluctuations and coincidences in the Training records not as general laws to be learned in the data (so-called 'overtraining' of the network). Rather, if you allow a mistake, it will do the job itself adaptive neuron network also useful statements when it is applied to new records that are not yet in the training data were included.

Wenn also ein ,Generalisierungsfehler' des selbst adaptierenden Neuronen-Netzes von 0.01–0.1 pro Merkmal normal ist, dann ist ein durch die Trainingsdaten-Komprimierung erzeugter mittlerer zusätzlicher Fehler von 0.00001 pro Merkmal vernachlässigbar und tolerabel, denn in diesem Fall wären vom selbst adaptierenden Neuronen-Netz zu findende Merkmalsausprägungen auf 3–4 Stellen Genauigkeit vom Diskretisierungsfehler unbeeinflusst.If thus a 'generalization error' of the self-adapting neuron network from 0.01-0.1 per feature is normal, then one is through the training data compression generated average additional Error of 0.00001 per feature negligible and tolerable, because in this case would be the self-adapting neuron network Characteristics to be found on 3-4 digits Accuracy unaffected by discretization error.

Für den Fall, dass auf ein Neuron zum Beispiel etwa 10.000 Datensätze abgebildet werden (was z. B. bei einem Netz von 30·40 Neuronen und etwa 12 Millionen Datensätzen im Durchschnitt der Fall ist), dann ergeben nach dem Gesetz der großen Zahlen die 10.000 un abhängigen und zufälligen Diskretisierungsfehler insgesamt genau dann einen Gesamt-Einfluss der Diskretisierungsfehler von 0.00001, wenn jeder einzelne Diskretisierungsfehler im Mittel 0.001 = 0.00001·√10000 beträgt. Ein mittlerer Diskretisierungsfehler eines einzelnen Datensatzes in einem normalisierten nummerischen Merkmal von 0.001 ist also akzeptabel.For the case of that on a neuron, for example, about 10,000 records (which, for example, in a network of 30 x 40 neurons and about 12 million records on average is), then according to the law of big numbers make 10,000 independent and random discretization errors in total then a total influence of the discretization error of 0.00001, if every single discretization error averages 0.001 = 0.00001 · √10000. A middle one Discretization error of a single record in a normalized numerical characteristic of 0.001 is therefore acceptable.

Der Vor-Ort-Client-Rechner kann dazu eingerichtet und programmiert sein, eine Diskretisierung der normalisierten, nummerischen Merkmalswerte in eine Anzahl diskreter Intervalle so durchzuführen, dass ein mittlerer Diskretisierungsfehler |Wert – DiskretWert| ungefähr 0.001 beträgt. Mit anderen Worten erfolgt eine Abbildung ,Wert → Intervallindex' (und zur Rückrechnung die Umkehrabbildung, Intervallindex → DiskretWert := Mittelpunkt des Intervalls) so, dass der mittlere Diskretisierungsfehler |Wert – DiskretWert| etwa 0.001 beträgt.Of the On-site client machine can be set up and programmed to a discretization of the normalized numerical feature values into a number of discrete intervals so that a medium discretization error | value - discrete value | is about 0.001. In other words, done a figure, Value → Interval Index '(and for recalculation the inverse mapping, interval index → discrete value: = midpoint of the interval) such that the mean discretization error | value - discrete value | is about 0.001.

Der Vor-Ort-Client-Rechner kann dazu eingerichtet und programmiert sein, den diskretisierten nummerischen Wert auf 8 bit (=1 Byte) Speicherplatz zu speichern, womit 254 Intervallindizes zur Verfügung gestellt werden, plus 2 Indizes für ,Wert nicht vorhanden' und ,ungültiger Wert'.Of the On-site client machine can be set up and programmed to the discretized numerical value to 8 bit (= 1 byte) storage space to store, with which 254 interval indexes available plus 2 indices for 'value not available' and, invalid value '.

Bei Verwendung mit anderen Data-Mining- oder Datenanalyseverfahren kann der Vor-Ort-Client auch dazu eingerichtet und programmiert sein, den diskretisierten nummerischen Wert in mehr oder weniger als 8 bits zu speichern. Erfordert ein Analyseverfahren eine höhere Genauigkeit als das SOM-Verfahren, könnten z. B. auf 10 bits 1022 verschiedene Intervallindizes (plus 2 Indizes für ,Wert nicht vorhanden' und ,ungültiger Wert') gespeichert werden – was den Diskretisierungsfehler gegenüber der Speicherung auf 8 bits um den Faktor 4 reduzieren würde.at Use with other data mining or data analysis techniques the on-site client also be set up and programmed to the discretized numerical value in more or less than 8 bits save. Does an analysis method require a higher one Accuracy as the SOM method, z. On 10 bits 1022 different interval indices (plus 2 indices for 'Value not available' and 'invalid value') are saved - as opposed to the discretisation error would reduce the storage to 8 bits by a factor of 4.

Der Vor-Ort-Client-Rechner kann außerdem so eingerichtet und programmiert sein, dass die Intervalleinteilung Wertverteilungs-abhängig, nicht-äquidistant ist. Dabei werden vorzugsweise die Intervall-Breiten der diskreten Teilintervalle in Bereichen hoher Wertedichten als besonders gering festgelegt. Dabei kann der mittlere Diskretisierungsfehler auch bei einer Speicherung auf nur 8 bits für die meisten praktisch relevanten Wertverteilungen (z. B. Normalverteilung, Exponentialverteilung, Weibull-Verteilung) unter etwa 0.001 gehalten werden.The on-premises client machine may also be set up and programmed so that the interval division is value distribution dependent, non-equidistant. In this case, preferably the interval widths The discrete subintervals in areas of high denominations are determined to be particularly low. The average discretization error can also be kept below about 0.001 even when stored on only 8 bits for most practically relevant value distributions (eg normal distribution, exponential distribution, Weibull distribution).

Der Vor-Ort-Client-Rechner kann dazu so eingerichtet und programmiert sein, dass bei nummerischen Daten-Werteverteilungen, die der Gauß- oder Normalverteilung mit einem Mittelwert m und einer Standardabweichung s folgen,

• etwa 64 Intervalle der Breite s/64 in den Bereichen [m – s, m[ und [m, m + s];
• etwa 32 Intervalle der Breite s/32 in den Bereichen [m – 2s, m – s[ und [m + s, m + 2s[;
• etwa 16 Intervalle der Breite s/16 in den Bereichen [m – 3s, m – 2s[ und [m + 2s, m + 3s[;
• etwa 8 Intervalle der Breite s/8 in den Bereichen [m – 4s, m – 3s[ und [m + 3s, m + 4s[;
• etwa 4 Intervalle der Breite s/4 in den Bereichen [m – 5s, m – 4s[ und [m + 4s, m + 5s[;
• etwa 2 Intervalle der Breite s/2 in den Bereichen [m – 6s, m – 5s[ und [m + 5s, m + 6s[; und
• etwa 1 Intervall unendlicher Breite für ]–∞, m – 6s[ und [m + 6s, ∞[.

festgelegt sind.The on-site client computer may be set up and programmed so that in numerical data value distributions that follow the Gaussian or normal distribution with a mean m and a standard deviation s,

• about 64 intervals of width s / 64 in the ranges [m - s, m [and [m, m + s];
• about 32 intervals of width s / 32 in the ranges [m - 2s, m - s [and [m + s, m + 2s [;
• about 16 intervals of width s / 16 in the ranges [m - 3s, m - 2s [and [m + 2s, m + 3s [;
• about 8 intervals of width s / 8 in the ranges [m - 4s, m - 3s [and [m + 3s, m + 4s [;
• about 4 intervals of width s / 4 in the ranges [m - 5s, m - 4s [and [m + 4s, m + 5s [;
• about 2 intervals of width s / 2 in the ranges [m - 6s, m - 5s [and [m + 5s, m + 6s [; and
• about 1 interval of infinite width for] -∞, m-6s [and [m + 6s, ∞ [.

are fixed.

Bei dieser Verteilung ist die Werte-Dichte im Bereich [m – s, m + s] hoch, sinkt im Abstandsbereich von 1s bis 2s vom Mittelwert stark ab und geht bei noch größeren Abständen vom Mittelwert rasch gegen 0.at this distribution is the value density in the range [m-s, m + s] high, decreases in the distance range from 1s to 2s from the mean value strong and goes at even greater distances from the mean quickly towards 0.

Die vorstehend beschriebene Intervallaufteilung ist optimiert für eine Speicherung auf 8 bit. Benutzt man mehr (weniger) als 8 bits pro Wert, sind die vorstehend genannten Zahlen pro mehr (weniger) verwendetem bit mit dem Faktor 2 zu multiplizieren (dividieren).The The interval division described above is optimized for a storage on 8 bit. If you use more (less) than 8 bits per value, the above numbers are per (more) less multiplied by a factor of 2 (divide).

Der Vor-Ort-Client-Rechner kann dazu so eingerichtet und programmiert sein, dass die diskreten Intervalle als Funktion von m und s symmetrisch um den Mittelwert m verteilt sind, wobei das Intervall [m – s/64, m[ die Intervallposition 127, und das Intervall [m, m + s/64[ die Position 128 hat, wobei der nummerische Wert, der jedem Intervall zugeordnet wird, der Intervallmittelpunkt ist, und wobei die Intervallpositionen 0 bzw. 255 für ungültige bzw. fehlende Werte reserviert sind. Diese Zahlen sind optimiert für eine Speicherung auf 8 bits. Benutzt man mehr (weniger) als 8 bits pro Wert, sind die vorstehend genannten Zahlen pro mehr (weniger) verwendetem bit mit dem Faktor 2 zu multiplizieren (dividieren).Of the On-site client computers can be set up and programmed to do so be that the discrete intervals as a function of m and s symmetric are distributed around the mean m, the interval [m -s / 64, m [ the interval position 127, and the interval [m, m + s / 64 [the position 128, with the numerical value assigned to each interval which is the interval center, and where the interval positions 0 or 255 reserved for invalid or missing values are. These numbers are optimized for storage on 8 bits. If you use more (less) than 8 bits per value, you are the above numbers per bit more (less) used multiply by a factor of 2 (divide).

Auf diese Weise ist für zumindest annähernd normalverteilte Daten eine Diskretisierung mit mittlerem Diskretisierungsfehler von nur 0.005s bei 8 bit Speicherbreite pro Wert erreichbar, denn für Werte im Intervall [m – s, m + s] (68% aller Werte) ist der mittlere Diskretisierungsfehler s/256, für Werte in [m – 2s, m – s[ und ]m + s, m + 2s] (28% aller Werte) sind es s/128, und für Werte in [m – 3s, m – 2s[ und ]m + 2s, m + 3s] (4% aller Werte) s/64. Dieser mittlere Diskretisierungsfehler beträgt etwa 0.005·s, ist also etwa um den Faktor 3 kleiner als der bei einer entsprechenden äquidistanten Diskretisierung des Bereiches [m – 7s, m + 7s] in 254 Intervalle zu erreichende, mittlere Diskretisierungsfehler.On this way is for at least approximately normal Data a discretization with medium discretization error achievable from just 0.005s at 8 bit memory width per value, because for values in the interval [m - s, m + s] (68% of all Values) is the mean discretization error s / 256, for Values in [m - 2s, m - s [and] m + s, m + 2s] (28% of all values) are s / 128, and for values in [m - 3s, m - 2s [and] m + 2s, m + 3s] (4% of all values) s / 64. This mean discretization error is about 0.005 · s, is therefore smaller by a factor of 3 than that of a corresponding equidistant Discretization of the range [m - 7s, m + 7s] in 254 intervals to be achieved, mean discretization error.

Der Vor-Ort-Client-Rechner kann dazu so eingerichtet und programmiert sein, dass nummerische Merkmale zunächst auf normalisierte Merkmale mit Mittelwert m = 0.5 und einer Standardabweichung von s = 0.25 abgebildet werden, so dass etwa alle Werte (96%) im Bereich zwischen 0 und 1 liegen. Somit sind die normalisierten nummerischen Merkmale vergleichbar mit normalisierten booleschen oder nominalen Merkmalen, deren Werte ebenfalls im Bereich zwischen 0 und 1 liegen.Of the On-site client computers can be set up and programmed to do so be that numerical features initially normalized to Features with mean m = 0.5 and a standard deviation of s = 0.25, so that approximately all values (96%) in the range between 0 and 1 lie. Thus, the normalized numerical Characteristics comparable to normalized boolean or nominal Characteristics whose values are also in the range between 0 and 1.

Damit beträgt der mittlere Diskretisierungsfehler für die normalisierten nummerischen Merkmale etwa 0.0013 und ist hinreichend genau für SOM-Netze.In order to is the mean discretization error for the normalized numerical features about 0.0013 and is sufficient exactly for SOM networks.

Der Vor-Ort-Client-Rechner kann dazu eingerichtet und programmiert sein, zur Ermittlung des Mittelwertes (m) und der Standardabweichung (s) des zu diskretisierenden Merkmals einen kleinen Bruchteil von etwa 1 Promille bis etwa 10% der Datensätze zu lesen um daraus für jedes nummerische Merkmal in den Daten einen vorläufigen Mittelwert (m^(vorl)) und eine vorläufige Streubreite (s^(vorl)) zu berechnen.The on-site client computer may be configured and programmed to read a small fraction of about 1 per mille to about 10% of the data sets to determine the mean (m) and standard deviation (s) of the feature to be discretized, for example each numerical feature in the data to calculate a provisional mean (m ^(vorl) ) and a preliminary spread (s ^(vorl) ).

Der Vor-Ort-Client-Rechner kann des Weiteren dazu eingerichtet und programmiert sein, alle Datensätze zu lesen und für alle nummerischen Merkmale eine vorläufige Diskretisierung basierend auf dem vorläufigen Mittelwert (m^(vorl)) und der vorläufigen Streubreite (s^(vorl)) durchzuführen.The on-premises client machine may be further configured and programmed to read all records and for all numerical features a preliminary discretization based on the preliminary mean (m ^(f) ) and the preliminary spread (s ^(f) ) perform.

Der Vor-Ort-Client-Rechner kann dabei dazu eingerichtet und programmiert sein,
65532 äquidistante Intervalle der Breite s⁽ ^vorl ⁾/256 zentriert um den vorläufigen Mittelwert (m^(vorl)), sowie
zwei offene Endintervalle ]–∞, m)vorl) – 32766 / 256·s^(vorl)[ und [m^(vorl) + 32766 / 256·s^(vorl),∞[ und zwei Intervallindizes, welche ,Wert nicht vorhanden' sowie
,ungültiger numerischer Wert' wiedergeben, und
für alle Intervalle die Häufigkeiten zu protokollieren, mit denen ein Wert in die jeweiligen Intervalle fällt.The on-site client computer can be set up and programmed to
65532 equidistant intervals of width s ⁽ ^vorl ⁾ / 256 centered around the provisional mean (m ^(vorl) ), as well as
two open end intervals] -∞, m) vorl) - 32766/256 · s ^(vorl) [and [m ^(vorl) + 32766/256 · s ^(vorl) , ∞ [and two interval ^indices , which ^contain 'value not available' and
Play 'invalid numeric value', and
for all intervals to record the frequencies with which a value falls within the respective intervals.

Diese Vorgehensweise kann als Vor-Diskretisierung vor der eigentlichen Diskretisierung verwendet werden, um Rechenzeit einzusparen. Die Vorgehensweise hat nämlich den technischen Vorteil, dass diese vorläufige Diskretisierung einerseits nur 16 Bit (2 Bytes) Speicher für einen nummerischen Werte benötigt, also den Datenumfang gegenüber einer ,double'-Fließkommazahl auf etwa ein Viertel reduziert, Andererseits ist diese vorläufige Diskretisierung aber auch in einem einzigen Durchlauf durch die gesamten Originaldaten erstellbar, selbst wenn Mittelwert und sonstige Verteilungsparameter (Standardabweichung, Schiefe, Verteilungsform) anfangs noch nicht bekannt sind. Eine direkte Komprimierung in ein 8-bit-Format ist dagegen erst möglich, wenn man die Verteilungsparameter zuvor exakt ermittelt hat – was normalerweise einen (zeitaufwändigen) separaten Lesedurchgang durch die gesamten Originaldaten erfordert. Dieser zusätzliche Lesedurchgang zur exakten Ermittlung von Mittelwert und Standardabweichen (und evtl. weiterer Verteilungsparameter) ist hier vermeidbar, weil das Komprimierungsschema hinreichend granular ist, um zunächst Mittelwert und Standardabweichung hinreichend grob angenähert zu schätzen oder zu raten (oder auf einem kleinen Datenraum näherungsweise zu bestimmen) und danach folgende Effekte abzufangen:

• Verschiebung des Mittelwerts (m ≠ m^(vorl)). Die vorläufige Diskretisierung deckt 128 Standardabweichungen rechts und links von dem vorläufigen Mittelwert m^(vorl) ab und kann daher beträchtliche spätere Mittelwertverschiebungen verkraften.
• Änderung der Streubreite (s ≠ s^(vorl)). Die vorläufige Diskretisierung ist über den gesamten abgedeckten Bereich von 256·s^(vorl) so fein, dass selbst bei einer deutlichen Streubreitenabweichung (s = 0.25·s^(vorl) ... 16·s^(vorl)) noch die spätere endgültige Diskretisierung daraus abgeleitet werden kann.
• Es liegt keine Normalverteilung vor. Die vorläufige Diskretisierung benutzt schmale, äquidistante Intervalle und kann Wertehäufungen an beliebigen Stellen der Verteilung fein wiedergeben.

This procedure can be used as pre-discretization prior to the actual discretization to save computation time. The procedure has the technical advantage that this provisional discretization on the one hand requires only 16 bits (2 bytes) of memory for a numerical value, ie reduces the data size to about one quarter compared to a double floating-point number. On the other hand, this preliminary discretization is also can be generated in a single pass through the entire original data, even if mean and other distribution parameters (standard deviation, skewness, distribution) are not yet known at the beginning. Direct compression into 8-bit format, on the other hand, is only possible once the distribution parameters have been accurately determined - which usually requires a (time-consuming) separate read through the entire original data. This additional reading pass for the exact determination of mean and standard deviation (and possibly further distribution parameters) is avoidable here, because the compression scheme is sufficiently granular to first estimate or guess the mean and standard deviation sufficiently roughly approximated (or approximate on a small data space ) and then catch the following effects:

• shift of the mean value (m ≠ m ^(vorl) ). The preliminary discretization covers 128 standard deviations to the right and left of the provisional mean ^(m) and can therefore ^absorb considerable later averages.
• Change of the spreading width (s ≠ s ^(vorl) ). The preliminary discretization is so fine over the entire covered area of 256 · s ^(vorl) that even with a clear spread width ^deviation (s = 0.25 · s ^(vorl) ... 16 · s ^(vorl) ) the final final discretization can be derived from this can be derived.
• There is no normal distribution. The preliminary discretization uses narrow, equidistant intervals and can finely reflect value accumulations anywhere in the distribution.

Der Vor-Ort-Client-Rechner kann weiterhin dazu eingerichtet und programmiert sein, die mitprotokollierten die Häufigkeiten der Intervall-Besetzungen zu analysieren und daraus die Werteverteilungsform und die dazu passende endgültige Diskretisierung abzuleiten.Of the On-site client machine can continue to set up and programmed be, who also logged the frequencies of the interval occupations to analyze and from this the value distribution form and the derive the appropriate final discretization.

Der Vor-Ort-Client-Rechner kann hierfür dazu eingerichtet und programmiert sein, bei zumindest annähernder Gleichverteilung zwei offene Endintervalle und dazwischen 2ⁿ – 4 äquidistante Intervalle zu bilden, bei denen als Obergrenze des unteren Endintervalls eine der vorläufigen Intervallgrenzen so festgelegt ist, dass im unteren Endintervall insgesamt etwa 1/(2ⁿ – 2) aller gültigen Werte liegen und als Breite der 2ⁿ – 4 äquidistanten Intervalle das kleinste Vielfache der vorläufigen Intervallbreite festgelegt ist, welches die Besetzung des verbleibenden oberen Endintervalls auf nicht mehr als 1/(2ⁿ – 2) aller gültigen Werte anwachsen lässt. Hierbei gilt: 1 < n ≤ 64.For this purpose, the on-site client computer can be set up and programmed to form two open end intervals with at least approximately equal distribution and 2 ⁿ -4 equidistant intervals between them, in which one of the provisional interval limits is defined as the upper limit of the lower end interval total of approximately 1 / (2 ⁿ - 2) of all valid values in the lower end interval and the width of the 2 ⁿ -4 equidistant intervals is the smallest multiple of the provisional interval width, which sets the occupation of the remaining upper end interval to not more than 1 / ( 2 ⁿ - 2) of all valid values. Where: 1 <n ≤ 64.

Der Vor-Ort-Client-Rechner kann weiterhin dazu eingerichtet und programmiert sein, bei zumindest annähernder Exponentialverteilung (Dichtefunktion d(x) = λ·e^–λx) zwei offene Endintervalle und dazwischen 2ⁿ – 4 Intervalle mit abnehmender Breite so festzulegen, dass als Obergrenze (g₁) des unteren Endintervalls eine der vorläufigen Intervallgrenzen so festgelegt ist, so dass im unteren Endintervall insgesamt etwa 1/(2ⁿ – 2) aller gültigen Werte liegen und die Intervallgrenze g_end, bestimmt wird, oberhalb der insgesamt etwa 1/(2ⁿ – 2) aller gültigen Werte liegen, wobei λ aus g₁ und g_end bestimt wird zu λ := In (2ⁿ – 2)/(g_end – g₁), und die Wunschbreite (b) des ersten Zwischenintervalls als b := In((2ⁿ – 3)/(2ⁿ – 2))/λ bestimmt wird. Damit liegen bei einer Exponentialverteilung in diesem Intervall genau 1/(2ⁿ – 2) aller gültigen Werte. Hierbei gilt: 1 < n ≤ 64. Bei einer Speicherbreite von 8 bit (1 Byte) pro Wert ergeben sich folgende Zahlenfaktoren: 252, 254, 1/254, In 254, In (253/254); bei 9 bit: 508, 1/510, In 510 und In(509/510).The on-site client computer can furthermore be set up and programmed to set two open end intervals and at least 2 ⁿ -4 intervals of decreasing width, at least approximately exponential distribution (density function d (x) = λ * e ^-λx ) such that as the upper limit (g ₁ ) of the lower end interval, one of the provisional interval limits is set so that in the lower end interval total about 1 / (2 ⁿ - 2) of all valid values and the interval limit g _end , is determined, above the total of about 1 / (2 ⁿ - 2) of all valid values, where λ is determined from g ₁ and g _end to λ: = In (2 ⁿ - 2) / (g _end - g ₁ ), and the desired width (b) of the first Interval interval as b: = In ((2 ⁿ - 3) / (2 ⁿ - 2)) / λ is determined. Thus, with an exponential distribution in this interval exactly 1 / (2 ⁿ - 2) of all valid values. In this case: 1 <n ≦ 64. Given a memory width of 8 bits (1 byte) per value, the following numerical factors result: 252, 254, 1/254, In 254, In (253/254); at 9 bit: 508, 1/510, In 510 and In (509/510).

Weiterhin kann der Vor-Ort-Client-Rechner bei zumindest annähernder Exponentialverteilung dazu eingerichtet und programmiert sein, als nächste Intervallgrenze (g₂) eine bestehende vorläufige Intervallgrenze so festzulegen, dass der Betrag der Differenz der Obergrenze (g₁) des unteren Endintervalls minus nächster Intervallgrenze (g₂) minus Wunschbreite (b) (|g₂ – g₁ – b|) minimal wird, die Wunschbreite (b) des ersten Zwischenintervalls mit dem Faktor e^λb zu multiplizieren und die nächsten Intervalle entsprechend zu berechnen. (d. h. minimiere |g₃ – g₂ – b|; multipliziere b mit e^λb, ...)Furthermore, the on-site client computer, at least approximately exponential distribution, can be set up and programmed to set an existing provisional interval limit as the next interval limit (g ₂ ) so that the magnitude of the upper limit difference (g ₁ ) of the lower end interval minus next interval boundary (g ₂₎ minus desired width (b) (| g ₂ - g ₁ - b |) is minimal, the desired width (b) to multiply the first intermediate interval by a factor of e ^{.lambda..sub.B} and to calculate the next intervals accordingly. (ie minimize | g ₃ - g ₂ - b |; multiply b by e ^λb , ...)

Für weitere Verteilungen, z. B. Weibull-Verteilung, logarithmische Verteilung, Poisson-Verteilung, etc. sind vergleichbare Spezialverfahren anzuwenden um die Intervalle zu diskretisieren.For further distributions, eg. B. Weibull distribution, logarithmic distribution, Poisson distribution, etc. are comparable special procedures apply to discretize the intervals.

Für Verteilungen, die zu keiner der individuell behandelten Verteilungsformen passen, kann zumindest annähernde Normalverteilung angenommen werden, wobei in diesem Fall der Vor-Ort-Client-Rechner dazu eingerichtet und programmiert sein kann, die am nächsten beim wahren Mittelwert (m) liegende vorläufige Intervallgrenze als Mittelpunkt (m) festzulegen, die Standardabweichung (s) so festzulegen, dass sie der wahren Standardabweichung möglichst nahe kommt und dass s/64 ein Vielfaches der bestehenden Intervallbreite ist, wobei die bestehenden vorläufigen Intervalle zu größeren neuen Intervallen zusammengefasst werden, die eine abnehmende Breitenverteilung von s/64, s/32, s/16, ... haben. Bei von 8 bit abweichender Speicherbreite pro Wert ergeben sich entsprechend andere Zahlenfaktoren.For Distributions that do not belong to any of the individually treated forms of distribution fit, at least approximate normal distribution can be assumed in which case the on-site client computer is set up to do so and can be programmed closest to the true one Mean (m) lying provisional interval limit as Set center point (m) to set the standard deviation (s) so that it comes as close as possible to the true standard deviation and that s / 64 is a multiple of the existing interval width, where the existing provisional intervals to larger ones new intervals are summarized, the decreasing width distribution from s / 64, s / 32, s / 16, .... With 8 bit different memory width per Value result according to other numerical factors.

Die bisher beschriebenen Verfahren inklusive einer eventuellen Vor-Diskretisierung in ein 16-bit-Format erlauben, eine auf die tatsächliche Werteverteilung jedes nummerischen Merkmals angepasste, hinreichend genaue Diskretisierung zu erzielen, welche mit 8 bit Speicherplatz pro Merkmal, oder einer anderen auf die anzuwendenden Analyseverfahren zugeschnitte Speicherbreite, auskommt. Die Originaldaten müssen dafür nur ein Mal komplett gelesen werden.The previously described methods including a possible pre-discretization in a 16-bit format, one on the actual Value distribution of each numerical feature adapted, sufficient to achieve exact discretization, which with 8 bit space per feature or another, tailored to the analytical methods to be used Memory width, gets by. The original data must be for it only once to be read completely.

Der Vor-Ort-Client-Rechner kann weiterhin dazu eingerichtet und programmiert sein, boolesche Merkmale in folgender Form auf 2 bit Speicherplatz zu speichern: (0: erster valider Wert, 1: zweiter valider Wert, 2: ,Wert nicht vorhanden', 3: ,ungültiger Wert'). Dies benötigt 2 bit Speicherplatz. Hat ein boolesches (=zweiwertiges) keine nicht vorhandenen oder ungültigen Werte, kommt man sogar mit 1 bit Speicherplatz aus (0: erster valider Wert, 1: zweiter valider Wert).Of the On-site client machine can continue to set up and programmed be, Boolean features in the following form on 2-bit space to store: (0: first valid value, 1: second valid value, 2: 'value not available', 3: 'invalid value'). This requires 2 bit storage space. Has a boolean (= bivalent) no non-existent or invalid values, one comes even with 1 bit memory space (0: first valid value, 1: second valid value).

Der Vor-Ort-Client-Rechner kann weiterhin dazu eingerichtet und programmiert sein, nominale (textuelle) Merkmale für die Verwendung in der SOM-Karten-Analyse in mehrere boolesche bzw. nummerische 0/1-Merkmale aufzuspalten, eines für jeden validen Nominalwert des nominalen Merkmals. In einem gegebenen Datensatz kann immer höchstens eines dieser 0/1-Merkmale den Wert 1 haben. Daher kann eine Datenspeicherung auf die Weise vorgenommen werden, dass anstelle des Nominalwerts eines Merkmals die Position (Index) dieses Wertes in der Liste aller validen Werte dieses Merkmals gespeichert wird. Zusätzlich können zwei Indexwerte geführt werden, welche ,kein Wert vorhanden' und ,Wert kommt in der Liste gültiger Werte nicht vor' repräsentieren.Of the On-site client machine can continue to set up and programmed be nominal (textual) features for use in SOM map analysis into several Boolean or numeric 0/1 features split, one for each valid nominal value of the nominal Feature. In a given record can always be at most one of these 0/1 characteristics has the value 1. Therefore, a data storage be made in the way that instead of the face value a feature the position (index) of that value in the list of all valid values of this feature is stored. additionally Two index values can be passed which 'no value exists' and, value comes in the list more valid Do not represent values before '.

Der Vor-Ort-Client-Rechner kann weiterhin dazu eingerichtet und programmiert sein, bei nominalen Merkmalen mit sehr vielen verschiedenen Ausprägungen die weniger häufig vorkommenden zu einer oder mehreren Gruppen zusammenzufassen. Zum Beispiel ist für die SOM-Methode normalerweise eine einzelne Behandlung von mehr als etwa 15 nominalen Merkmalsausprägungen pro Merkmal nicht sinnvoll. In diesem Fall kann man die 14 häufigsten Ausprägungen als Einzelwerte behandeln und alle weiteren Ausprägungen zur Gruppe ,sonstige' zusammenfassen.Of the On-site client machine can continue to set up and programmed be, with nominal characteristics with very many different forms the less common to one or more To group together. For example, for the SOM method usually a single treatment of more than about 15 nominal Characteristic values per characteristic do not make sense. In this case you can see the 14 most common characteristics as individual values treat all other characteristics of the group 'other' sum up.

Der Vor-Ort-Client-Rechner kann hierfür dazu eingerichtet und programmiert sein, folgende Schritte auszuführen:
Alle Original-Datensätze werden gelesen und für jedes nominale Merkmal in den Original-Daten wird ein zum ersten Mal vorkommender Wert in einem ,Wörterbuch' abgespeichert, in dem jedem vorkommenden Wert eine Index-Nummer zugeordnet und die jeweilige Vorkommenshäufigkeit erfasst wird. Wenn eine benutzerdefinierte Schranke, z. B. etwa 30, 1000, oder 65534, unterschiedliche Werte in dem Wörterbuch eingetragen sind, beende das Einfügen neuer Werte und führe alle danach vorkommenden Werte, die nicht im Wörterbuch auf treten, unter der vorletzten letzten Indexposition, welche ,anderer Wert' wiedergibt, während die letzte Indexposition ,kein Wert vorhanden' wiedergibt.The on-premises client machine may be set up and programmed to do the following:
All original data records are read and for each nominal feature in the original data a first occurring value is stored in a 'dictionary' in which each index value is assigned an index number and the respective occurrence frequency is recorded. If a custom barrier, e.g. 30, 1000, or 65534, different values are entered in the dictionary, terminate the insertion of new values and perform all subsequent values that do not occur in the dictionary, below the penultimate last index position, which represents 'other value', while the last index position, no value exists'.

Die Nominalwerte in dem Wörterbuch werden durch die Indexnummer (8 Bit oder 16 Bit Ganzzahl) ersetzt. Dies reduziert den Datenumfang bereits beträchtlich gegenüber der vorher erforderlichen Speicherung von Zeichenketten.The Nominal values in the dictionary are indicated by the index number (8-bit or 16-bit integer) replaced. This reduces the data size already considerably higher than previously required Storing strings.

Das Wörterbuch wird nach absteigender Vorkommenshäufigkeit sortiert. Die 14 häufigsten Werte werden als separate Werte betrachtet und ihnen werden die Index-Nummer 0 bis 13 zugewiesen. Allen anderen Werten im Wörterbuch, und auch dem bisherigen Index ,anderer Wert' wird der Wert 14 zugewiesen. Aus dem bisherigen Index ,Wert nicht vorhanden' wird der Index 15. In den vorläufig komprimierten Daten werden die vorläufigen Index-Nummern durch die neuen Index-Nummern ersetzt. Jede Index-Nummer kann dann auf nicht mehr als 4 Bit Speicherplatz gespeichert werden.The Dictionary is based on decreasing frequency of occurrence sorted. The 14 most common values are considered separate values and are assigned the index number 0 to 13. All other values in the dictionary, and also the previous ones Index, other value 'is assigned the value 14. From the previous one Index, value not available 'becomes the index 15. In the provisional compressed data becomes the provisional index numbers replaced by the new index numbers. Each index number can then be stored on no more than 4 bits of storage space.

Insgesamt wird mit diesem Verfahren eine auf die tatsächlichen Wertehäufigkeiten jedes nominalen Merkmals angepasste, hinreichend genaue Komprimierung erzielt, welche mit 4 Bit Speicherplatz pro nominalem Merkmal auskommt. Die Originaldaten mussten dafür nur ein Mal komplett gelesen werden.Overall, this method achieves a sufficiently accurate compression adapted to the actual value frequencies of each nominal feature, with 4 bits of storage per nomina lem feature. The original data had to be read only once completely.

Sofern mehr als 14 Einzelwerte betrachten werden sollen, kann der neue Index nicht 4 Bit, sondern 8 Bit Speicherplatz einnehmen. Dann kann man bis zu 254 verschiedene Einzelwerte darstellen.Provided More than 14 individual values should be considered, the new Index not 4 bits, but occupy 8 bits of memory space. Then can one can represent up to 254 different individual values.

Diese Vorgehensweise eignet sich für eine Parallelisierung auf einem Mehrprozessor-Rechner oder Rechner-Netzwerk. Die vorläufige Komprimierung kann auf partitionierten Daten parallelisiert werden. Anschließend erfolgt ein Datenaustausch zwischen den parallelen Threads, um globale Statistiken (Mittelwert, Standardabweichung) bzw. Wertehäufigkeiten zu ermitteln. Nachdem diese Informationen zwischen den einzelnen Threads kommuniziert wurden, erfolgt die endgültige Komprimierung parallelisiert.These Approach is suitable for parallelization a multiprocessor computer or computer network. The preliminary Compression can be parallelized to partitioned data. Subsequently, a data exchange takes place between the parallel ones Threads to global statistics (mean, standard deviation) or to determine value frequencies. After this information between the individual threads, the parallelized final compression.

Die beschriebenen Komprimierungstechniken bewirken außerdem eine Anonymisierung der Daten. Wenn zum Beispiel ein Nutzer eines Datenanalyse-Servers dem Server seine Daten aus Datenschutz- oder Geheimhaltungsgründen nicht unverschlüsselt bereitstellen möchte, kann er die Datenkomprimierung auf seinem eigenen Rechner (Vor-Ort-Client-Rechner) durchführen. Die komprimierten Daten (welche nur noch Intervall-Indizes für nummerische Daten und Wert-Indizes für binäre und nominale Daten enthalten) werden zu dem Datenanalyse-Server übermittelt. Die ,Dekomprimierungs-Informationen – d. h. für nummerische Daten Mittelwerte, Standardabweichungen, Verteilungsform, Minimum, Maximum und evtl. weitere Information, die zur Diskretisierung verwendet wurde, und für binäre und nominale Daten die Werte-”Wörterbücher”, welche einen Rückschluss vom Werteindex zum tatsächlichen Wert ermöglichen, verbleiben auf dem Vor-Ort-Client-Rechner. Somit kann ein unbefugter Betrachter der komprimierten Daten mit diesen nichts anfangen. Der Analyse-Server erstellt Analysen, Auswertungen, SOM-Modelle usw. auf Basis der Intervall- und Werteindizes der komprimierten Daten und schickt diese Ergebnisse zurück an den Vor-Ort-Client-Rechner. Dieser ist dazu eingerichtet und programmiert, die Ergebnisse mit den Dekomprimierungsinformationen zu verknüpfen, wodurch die Ergebnisse mit den ursprünglichen Informationen zur Verfügung stehen.The also described compression techniques cause an anonymization of the data. For example, if a user of a Data analysis servers the server its data for privacy or confidentiality reasons can not provide unencrypted he compresses the data on his own computer (on-site client computer) carry out. The compressed data (which only has interval indexes for numeric data and value indices for binary and nominal data) are transmitted to the data analysis server. The, Decompression Information - d. H. For numerical data mean values, standard deviations, distribution form, Minimum, maximum and possibly further information required for discretization was used, and for binary and nominal data the values "dictionaries" which a conclusion from the value index to the actual one Allow value to remain on the on-premises client machine. Thus, an unauthorized viewer of the compressed data with do not care about them. The analysis server creates analyzes, evaluations, SOM models, etc., based on the interval and value indices of the compressed ones Data and sends these results back to the on-site client machine. This is set up and programmed with the results to join the decompression information, causing the Results with the original information available stand.

Durch die automatische Diskretisierung von nummerischen Werten und durch die Zusammenfassung von selten vorkommenden Nominalwerten zur Gruppe ,andere' werden automatisch gesetzliche Regelungen und Vorschriften eingehalten, welche Auswertungen und Analysen verbieten, wenn die Ergebnissgruppen so klein sind, dass daraus auf Einzelpersonen geschlossen werden könnte.By the automatic discretization of numerical values and by the summary of seldom occurring nominal values of the group 'others' will automatically become statutory regulations and regulations which prohibit evaluations and analyzes when the Result groups are so small that it is closed to individuals could be.

Die nachfolgende Beispielimplementierung enthält folgende Einschränkungen/Vereinfachungen gegenüber dem allgemeinen Konzept:
Bei der Kompression nummerischer Merkmale wurden nicht die Spezialdiskretisierungen für bestimmte Verteilungsformen implementiert, sondern nur das Basisverfahren, welches annähernde Normalverteilung annimmt.The following example implementation contains the following restrictions / simplifications compared to the general concept:
Compressing numerical features did not implement the special discretizations for particular distributions, but only the basic method, which approximates normal distribution.

Binäre und nominale Merkmale werden nicht unterschieden sondern als ,kategorische' Merkmale auf 4 Bit Speichergröße komprimiert.binary and nominal features are not distinguished but as 'categorical' Features compressed to 4-bit memory size.

Die Beispielimplementierung ist in der Programmiersprache C++ programmiert. Folgende Stil-Konventionen wurden befolgt: Variablennamen und Funktionsnamen beginnen mit Kleinbuchstaben, Typen und Klassen mit Großbuchstaben. Konstanten bestehen nur aus Großbuchstaben. Instanzvariablen von Klassen haben das Präfix ,iv', bzw. ,piv', wenn es sich um eine Zeiger-Variable handelt.The Example implementation is programmed in C ++ programming language. The following style conventions were followed: variable names and function names begin with lowercase letters, types and uppercase classes. Constants consist only of uppercase letters. instance variables of classes have the prefix 'iv', or 'piv', if it is is a pointer variable.

Die Implementierung besteht aus einem Aufzählungstyp und 4 Haupt-Klassen:
enum FieldType {CONTINUOUS, DISCR_NUMERIC, BINARY, NOMINAL} beschreibt die Merkmalstypen (Gleitkommazahl, Ganzzahl, Binär, Nominal).The implementation consists of an enumeration type and 4 main classes:
enum FieldType {CONTINUOUS, DISCR_NUMERIC, BINARY, NOMINAL} describes the feature types (floating-point number, integer, binary, nominal).

Die Klasse GaussianCompress führt die Komprimierung und Dekomprimierung eines nummerischen Merkmals durch (unter der Annahme, dass die Werteverteilung annähernd normalverteilt ist).The Class GaussianCompress performs the compression and decompression of a numerical feature by (assuming that the value distribution is approximately normally distributed).

Die Klasse DataDescription beschreibt die Trainingsdaten: Merkmalsnamen, Merkmalstypen, Anzahl verschiedener valider Werte der nominalen und diskret nummerischen Merkmale, Mittelwerte und Standardabweichungen der nummerischen Merkmale.The Class DataDescription describes the training data: Characteristic types, number of different valid values of the nominal and discrete numerical features, means and standard deviations the numerical characteristics.

Die Klasse enthält ein Objekt vom Type GaussianCompress für jedes nummerische Merkmal. Die Klasse DataRecord enthält die Daten eines einzelnen Datensatzes. Die Klasse ist in der Lage, das binär komprimierte Datenformat aus einem Objekt vom Typ DataPage zu lesen und zu dekomprimieren (wobei sie die Objekte vom Typ GaussianCompress verwendet, welche sie in der DataDescription des DataPage-Objekts findet).The Class contains an object of type GaussianCompress for every numerical characteristic. The class DataRecord contains the data of a single record. The class is able to the binary compressed data format from an object of Type DataPage to read and decompress (taking the objects of type GaussianCompress, which they use in the DataDescription of the DataPage object).

Die Klasse DataPage enthält eine Serie von komprimierten Datensätzen, welche sie mit Hilfe der Methode appendDataRecord einlesen und dabei komprimieren kann, und mit der Methode retrieveNextDataRecord in Form eines Objekts vom Typ DataRecord wieder auslesen und dabei automatisch dekomprimieren. Jede Klassen-Instanz enthält also einen Teil oder alle Trainingsdaten in komprimierter Form für das Training eines SOM-Netzes. Die Methode readRecordFromDataPage dekomprimiert den Datensatz, welcher ab der Speicheradresse pData gespeichert ist.The Class DataPage contains a series of compressed records, which they read in with the help of the appendDataRecord method and retrieveNextDataRecord in method Read out the shape of an object of type DataRecord again decompress automatically. Each class instance contains So a part or all training data in compressed form for the training of a SOM network. The method readRecordFromDataPage decompresses the data set, which starts at the memory address pData is stored.

Die dekomprimierten nummerischen Werte werden in das Feld pivNumValues geschrieben, die dekomprimierten binären und nominalen Feldwerte in das Feld pivCatValues.The decompressed numeric values are in the pivNumValues field written, the decompressed binary and nominal Field values in the pivCatValues field.

Die Methode appendDataRecord komprimiert den Datensatz dataRecord, welcher in Form einer Zeichenkette vorliegt, bei der die einzelnen Merkmalsausprägungen durch das Separator-Zeichen ivDataDescr.getSeparator() getrennt sind.The Method appendDataRecord compresses the record dataRecord, which in the form of a string, in which the individual characteristic values separated by the separator character ivDataDescr.getSeparator () are.

Die Klasse GaussianCompress speichert Mittelwert m und Standardabweichung s einer Verteilung nummerischer Merkmalsausprägungen. Außerdem bildet die Klasse jeden beliebigen nummerischen Wert auf eines von 256 diskreten Intervallen abzubilden, d. h. auf einen diskreten Wert zwischen 0 und 255, und auch umgekehrt zu einem gegebenen Intervallindex den nummerischen Wert des Intervallmittelpunktes zu liefern.The Class GaussianCompress stores mean m and standard deviation s a distribution of numerical feature values. Furthermore the class forms any numerical value on one of 256 discrete intervals, d. H. on a discreet Value between 0 and 255, and vice versa for a given interval index to provide the numerical value of the interval center.

Die diskreten Intervalle sind wie folgt als Funktion von m und s definiert

• 64 Intervalle der Breite s/64 in den Bereichen [m – s, m[ und [m, m + s]
• 32 Intervalle der Breite s/32 in den Bereichen [m – 2s, m – s[ und [m + s, m + 2s[
• 16 Intervalle der Breite s/16 in den Bereichen [m – 3s, m – 2s[ und [m + 2s, m + 3s[
• 8 Intervalle der Breite s/8 in den Bereichen [m – 4s, m – 3s[ und [m + 3s, m + 4s[
• 4 Intervalle der Breite s/4 in den Bereichen [m – 5s, m – 4s[ und [m + 4s, m + 5s[
• 2 Intervalle der Breite s/2 in den Bereichen [m – 6s, m – 5s[ und [m + 5s, m + 6s[
• 1 Intervall unendlicher Breite für ]–∞, m – 6s[ und [m + 6s, ∞[.

The discrete intervals are defined as a function of m and s as follows

• 64 intervals of width s / 64 in the ranges [m - s, m [and [m, m + s]
• 32 intervals of width s / 32 in the ranges [m - 2s, m - s [and [m + s, m + 2s [
• 16 intervals of width s / 16 in the ranges [m - 3s, m - 2s [and [m + 2s, m + 3s [
• 8 intervals of width s / 8 in the ranges [m - 4s, m - 3s [and [m + 3s, m + 4s [
• 4 intervals of width s / 4 in the ranges [m - 5s, m - 4s [and [m + 4s, m + 5s [
• 2 intervals of width s / 2 in the ranges [m - 6s, m - 5s [and [m + 5s, m + 6s [
• 1 interval of infinite width for] -∞, m-6s [and [m + 6s, ∞ [.

Die Intervalle sind symmetrisch um den Mittelwert m verteilt. Das heißt, das Intervall [m – s/64, m[ hat die Intervallposition 127, und [m, m + s/64[ die Position 128. Der jedem Intervall zugeordnete nummerische Wert ist der Intervallmittelpunkt. Das bedeutet, dem Intervall 127 ist der Wert m – s/128 zugeordnet, und dem Intervall 128 der Wert m + s/128. Dem Intervall 1 wird der Wert m – 6.5s zugeordnet, dem Intervall 254 der Wert m + 6.5s. Die Intervallpositionen 0 bzw. 255 sind reserviert für ungültige bzw. fehlende (SQL NULL) Werte. Der diesen Positionen zugeordnete nummerische Wert ist DBL_MAX. class GaussianCompress

class DataDescription

class DataRecord

void DataRecord::readRecordFromDataPage(const DataDescription& descr, const unsigned char* const pData)

class DataPage

bool DataPage::appendDataRecord(const string& dataRecord)

The intervals are distributed symmetrically around the mean m. That is, the interval [m - s / 64, m [has the interval position 127, and [m, m + s / 64] the position 128. The numerical value assigned to each interval is the interval center. That is, the interval 127 is assigned the value m - s / 128, and the interval 128 is the value m + s / 128. The interval 1 is assigned the value m - 6.5s, the interval 254 the value m + 6.5s. The interval positions 0 and 255 are reserved for invalid or missing (SQL NULL) values. The numeric value assigned to these positions is DBL_MAX. class GaussianCompress

class DataDescription

class DataRecord

void DataRecord :: readRecordFromDataPage (constDataDescription & descr, const unsigned char * const pData)

class DataPage

bool DataPage :: appendDataRecord (const string & dataRecord)

Diese Computerprogrammobjekte sind zur Ausführung in einem elektronischen Datenverarbeitungssystem mit wenigstens einem Analyse-Server und wenigstens einem Vor-Ort-Client-Rechner vorgesehen. Der Analyse-Server hat ein oder mehrere Computerprogrammobjekte um ein selbst adaptierendes Neuronen-Netz zu implementieren, das auf eine Datenbank mit einer Vielzahl Datensätzen mit vielen Merkmalen zu trainieren ist. Der Vor-Ort-Client-Rechner hat ein oder mehrere Computerprogrammobjekte, um ihm zugeführte Daten einer Datenvorverarbeitung und/oder einer Datenkompression zu unterziehen, und um die Daten von dem Vor-Ort-Client-Rechner an den Analyse-Server zu senden. Ein oder mehrere Computerprogrammobjekte des Analyse-Servers trainieren mit den empfangenen, vorverarbeiteten / komprimierten Daten das selbst adaptierende Neuronen-Netz, indem die Daten dem sich selbst adaptierenden Neuronen-Netz wiederholt präsentiert werden. Ein oder mehrere Computerprogrammobjekte des Analyse-Servers führen anschließend eine Analyse durch um ein selbst adaptierendes Neuronen-Netz-Modell oder ein anderes Data-Mining-Analyseresultat zu erstellen. Ein oder mehrere Computerprogrammobjekte des Analyse-Servers bewirken ein Versenden des selbst adaptierenden Neuronen-Netz-Modells oder sonstigen Data-Mining-Analyseresultats von dem Analyse-Server an den Vor-Ort-Client-Rechner. Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners unterziehen die Daten des selbst adaptierenden Neuronen-Netz-Modells oder sonstigen Data-Mining-Analyseresultats einer Dekomprimierung.These computer program objects are intended for execution in an electronic data processing system with at least one analysis server and at least one on-site client computer. The analysis server has one or more computer program objects to implement a self-adapting neuron network that is to be trained on a database having a plurality of records with many features. The on-premises client computer has one or more computer program objects for data pre-processing and / or data compression supplied to it and for sending the data from the on-site client computer to the analysis server. One or more analysis server computer program objects use the received, preprocessed / compressed data to train the self-adapting neuron network by repeatedly presenting the data to the self-adapting neuron network. One or more analysis server computer program objects then perform an analysis by creating a self-adapting neuron network model or other data mining analysis result. One or more computer program objects of the analysis server cause the self-adapting neuron network model or other data mining analysis result to be sent from the analysis server to the on-premises client computer. One or more computer program objects of the on-premises client computer subject the data of the self-adapting neuron network model or other data mining analysis result to decompression.

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können die Art der Datenkompression an die Art oder den Aufbau der Daten anpassen.One or more computer program objects of the on-premises client computer The type of data compression can be adapted to the type or the type of data compression Adapt the structure of the data.

Ein oder mehrere Computerprogrammobjekte des Analyse-Servers können das selbst adaptierende Neuronen-Netz so oft mit den empfangenen anonymisierten Daten trainieren, bis sich ein auskonvergierter Netzzustand ergibt, der die Daten angemessen repräsentiert.One or multiple analysis server computer program objects the self-adapting neuron network so often with the received train anonymized data until an out-converged network state which adequately represents the data.

Ein oder mehrere Computerprogrammobjekte des Analyse-Servers können die Daten dem selbst adaptierenden Neuronen-Netz etwa 100 bis etwa 200 Mal präsentieren.One or multiple analysis server computer program objects the data to the self-adapting neuron network about 100 to about Present 200 times.

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können die ihm zugeführten Daten im Umfang von bis zu etwa 10 Gigabyte bis mehreren Terabyte der Datenvorverarbeitung und der Datenkompression unterziehen.One or more computer program objects of the on-premises client computer The data supplied to it may amount to up to about 10 gigabytes to several terabytes of data preprocessing and undergo data compression.

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können die ihm zugeführten Daten bei der Datenvorverarbeitung einmal lesen, und darin enthaltene Originalmerkmale auf rein nummerische normalisierte Merkmale transformieren.One or more computer program objects of the on-premises client computer can the data supplied to him in the data preprocessing read once, and original features contained therein purely numerical transform normalized features.

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können bei der Datenkompression die normalisierten nummerischen Merkmalsausprägungen der ihm zugeführten Daten komprimieren, so dass zwischen zwei Bit und etwa 8–10 bit als Speicherplatz pro Merkmalsausprägung benötigt wird.One or more computer program objects of the on-premises client computer can in the data compression the normalized numeric Feature characteristics of the data supplied to him compress, leaving between two bits and about 8-10 bits required as storage space per characteristic value becomes.

Ein oder mehrere Computerprogrammobjekte des Analyse-Servers können mindestens ein Teilobjekt der komprimierten Daten jeweils im Arbeitsspeicher zur Verarbeitung halten, während andere Teilobjekte der komprimierten Daten in Form binärer Datenobjekte auf dem Massenspeicher des Analyse-Servers gehalten werden, von wo sie durch blockweise Leseoperationen in den Arbeitsspeicher zur Verarbeitung gelesen werden.One or multiple analysis server computer program objects at least one sub-object of the compressed data in each case in the main memory while others are subcomponents of the compressed Data in the form of binary data objects on the mass storage of the analysis server, from where they are blocked by block Read operations are read into memory for processing become.

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können in Abhängigkeit von einem für die jeweilige Analyseaufgabe akzeptablen Kompressionsfehler ein zu verwendendes Datenkompressionsverfahren festlegen, wobei das verwendendes Kompressionsverfahren und die zu erzielende Kompressionrate abhängig von den unterschiedlichen Merkmalstypen (boolesch, nummerisch, nominal (textuell)) festgelegt werden.One or more computer program objects of the on-premises client computer can depend on one for the respective analysis task will accept acceptable compression errors specify the data compression method to be used, the using the compression method and the compression rate to be achieved depending on the different feature types (Boolean, numerical, nominal (textual)).

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können den mittleren Vorhersagefehler für nummerische Merkmale zwischen etwa 0.01 und etwa 0.1 legen.One or more computer program objects of the on-premises client computer can be the mean prediction error for numeric Characteristics between about 0.01 and about 0.1 lay.

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können eine Diskretisierung der normalisierten, nummerischen Merkmalswerte in eine Anzahl diskreter Intervalle so durchführen, dass ein mittlerer Diskretisierungsfehler (|Wert – DiskretWert|) etwa 0.001 beträgt.One or more computer program objects of the on-premises client computer can be a discretization of normalized, numerical Perform characteristic values in a number of discrete intervals so that a medium discretization error (| value - discrete value |) is about 0.001.

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können den diskretisierten nummerischen Wert auf 2ⁿ – 2 Intervallindizes, sowie Zustände ,Wert nicht vorhanden' und ,ungültiger Wert' abbilden und auf n bit Speicherplatz speichern, wobei 1 < n ≤ 64.One or more on-premises client computer program objects may map the discretized numerical value to 2 ⁿ -2 interval indexes, as well as "value not present" and "invalid value" and store it on n bit of memory space, where 1 <n ≤ 64 ,

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können die Intervalleinteilung Wertverteilungs-abhängig, nicht-äquidistant festlegen, wobei vorzugsweise die Intervall-Breiten der diskreten Teilintervalle in Bereichen hoher Wertedichten als besonders gering festgelegt werden.One or more computer program objects of the on-premises client computer can the interval division depend on the distribution of value, set non-equidistant, preferably the interval widths the discrete subintervals in areas of high denominations than be set particularly low.

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können bei nummerischen Daten-Werteverteilungen, die der Gauß- oder Normalverteilung mit einem Mittelwert (m) und einer Standardabweichung (s) folgen,

• etwa 64 Intervalle der Breite s/64 in den Bereichen [m – s, m[ und [m, m + s];
• etwa 32 Intervalle der Breite s/32 in den Bereichen [m – 2s, m – s[ und [m + s, m + 2s[;
• etwa 16 Intervalle der Breite s/16 in den Bereichen [m – 3s, m – 2s[ und [m + 2s, m + 30s[;
• etwa 8 Intervalle der Breite s/8 in den Bereichen [m – 4s, m – 3s[ und [m + 3s, m + 4s[;
• etwa 4 Intervalle der Breite s/4 in den Bereichen [m – 5s, m – 4s[ und [m + 4s, m + 5s[;
• etwa 2 Intervalle der Breite s/2 in den Bereichen [m – 6s, m – 5s[ und [m + 5s, m + 6s[; und
• etwa 1 Intervall unendlicher Breite für ]-∞, m – 6s[ und [m + 6s, ∞[.

festlegen.One or more computer program objects of the on-premises client computer may follow numerical data value distributions that follow the Gaussian or normal distribution with a mean (m) and a standard deviation (s),

• about 64 intervals of width s / 64 in the ranges [m - s, m [and [m, m + s];
• about 32 intervals of width s / 32 in the ranges [m - 2s, m - s [and [m + s, m + 2s [;
• about 16 intervals of width s / 16 in the ranges [m - 3s, m - 2s [and [m + 2s, m + 30s [;
• about 8 intervals of width s / 8 in the ranges [m - 4s, m - 3s [and [m + 3s, m + 4s [;
• about 4 intervals of width s / 4 in the ranges [m - 5s, m - 4s [and [m + 4s, m + 5s [;
• about 2 intervals of width s / 2 in the ranges [m - 6s, m - 5s [and [m + 5s, m + 6s [; and
• about 1 interval of infinite width for] -∞, m-6s [and [m + 6s, ∞ [.

establish.

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können die diskreten Intervalle als Funktion von Mittelwert (m) und Standardabweichung (s) symmetrisch um den Mittelwert m verteilen, wobei vorzugsweise

– das Intervall [m – s/64, m[ die Intervallposition 127, und
– das Intervall [m, m + s/64[ die Position 128 hat, und wobei vorzugsweise
– der nummerische Wert, der jedem Intervall zugeordnet wird, der Intervallmittelpunkt ist,

und die Intervallpositionen 0 bzw. 255 für ungültige bzw. fehlende Werte reserviert sind.One or more on-premises client computer program objects may distribute the discrete intervals symmetrically about the mean m as a function of mean (m) and standard deviation (s), preferably

- the interval [m - s / 64, m [the interval position 127, and
- the interval [m, m + s / 64 [has the position 128, and preferably
The numerical value assigned to each interval that is the interval center,

and the interval positions 0 and 255 are reserved for invalid or missing values.

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können nummerische Merkmale zunächst auf normalisierte Merkmale mit Mittelwert m = 0.5 und einer Standardabweichung von s = 0.25 abbilden.One or more computer program objects of the on-premises client computer First, numerical features can be normalized to Features with mean m = 0.5 and a standard deviation of depict s = 0.25.

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können zur Ermittlung des Mittelwertes (m) und der Standardabweichung (s) des zu diskretisierenden Merkmals einen Bruchteil von etwa 1 Promille bis etwa 10% der Datensätze zu lesen um daraus für jedes nummerische Merkmal in den Daten einen vorläufigen Mittelwert (m^(vorl)) und eine vorläufige Streubreite (s^(vorl)) zu berechnen.One or more computer program objects of the on-premises client computer may use to read the fractional value (m) and the standard deviation (s) of the feature to be discretized from a fraction of about 1 per thousand to about 10% of the data records for each numerical feature to calculate in the data a preliminary mean (m ^(vorl) ) and a preliminary spread (s ^(vorl) ).

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können alle Datensätze zu lesen und für alle nummerischen Merkmale eine vorläufige Diskretisierung basierend auf dem vorläufigen Mittelwert (m^(vorl)) und der vorläufigen Streubreite (s^(vorl)) durchführen.One or more computer program ^{objects of the on-premises} client computer can read all records and ^perform preliminary discretization for all numerical features based on the preliminary mean (m ^(f) ) and the preliminary spread ^{(f (f)} ).

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können
65532 äquidistante Intervalle der Breite s^(vorl)/256 zentriert um den vorläufigen Mittelwert (m^(vorl)), sowie
zwei offene Endintervalle ]-∞, m^(vorl) – 32766 / 256 – s^(vorl)[ und [m^(vorl) + 32766 / 256 · s^(vorl), ∞[ und zwei Intervallindizes, welche ,Wert nicht vorhanden' sowie
,ungültiger numerischer Wert' festlegen, und
für alle Intervalle die Häufigkeiten protokollieren, mit denen ein Wert in die jeweiligen Intervalle fällt.One or more computer program objects of the on-premises client computer may
65532 equidistant intervals of width s ^(vorl) / 256 centered around the provisional mean (m ^(vorl) ), as well as
two open end intervals] -∞, m ^(vorl) - 32766/256 - s ^(vorl) [and [m ^(vorl) + 32766/256 · s ^(vorl) , ∞ [and two interval ^indices , which ^contain 'value not available' and
Set 'invalid numeric value', and
for all intervals, record the frequencies with which a value falls within the respective intervals.

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können die mitprotokollierten die Häufigkeiten der Intervall-Besetzungen analysieren und daraus die Werteverteilungsform und die dazu passende endgültige Diskretisierung ableiten.One or more computer program objects of the on-premises client computer the recorded frequencies can be logged analyze the interval occupancies and from this the value distribution form and derive the appropriate final discretization.

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können bei zumindest annähernder Gleichverteilung zwei offene Endintervalle und dazwischen 2ⁿ – 4 äquidistante Intervalle bilden, bei denen als Obergrenze des unteren Endintervalls eine der vorläufigen Intervallgrenzen so festgelegt ist, dass im unteren Endintervall insgesamt etwa 1/2ⁿ aller gültigen Werte liegen und als Breite der 2ⁿ – 4 äquidistanten Intervalle das kleinste Vielfache der vorläufigen Intervallbreite festgelegt ist, welches die Besetzung des verbleibenden oberen Endintervalls auf nicht mehr als 1/2ⁿ aller gültigen Werte anwachsen lässt, wobei gilt: 1 < n ≤ 64.One or more computer program objects of the on-site client computer can form at least approximately equal distribution two open end intervals and between 2 ⁿ - 4 equidistant intervals in which the upper limit of the lower end interval one of the provisional interval limits is set so that in the lower end of total are about 1/2 ^{n of} all valid values and the width of the 2 ⁿ - 4 equidistant intervals is the smallest multiple of the provisional interval width, which increases the occupation of the remaining upper end interval to not more than 1/2 ^{n of} all valid values the following applies: 1 <n ≤ 64.

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können bei zumindest annähernder Exponentialverteilung (Dichtefunktion d(x) = λ·e^–λx) zwei offene Endintervalle und dazwischen 2ⁿ – 4 Intervalle mit abnehmender Breite so festlegen, dass als Obergrenze (g₁) des unteren Endintervalls eine der vorläufigen Intervallgrenzen so festgelegt ist, so dass im unteren Endintervall insgesamt etwa 1/2ⁿ – 2 aller gültigen Werte liegen und die Intervallgrenze g_end, bestimmt wird, oberhalb der insgesamt etwa 1/2ⁿ – 2 aller gültigen Werte liegen, wobei λ aus g₁ und g_end bestimt wird zu λ := In(2ⁿ – 2)/(g_end – g₁), und die Wunschbreite (b) des ersten Zwischenintervalls als b := In((2ⁿ – 3)/(2ⁿ – 2))/λ bestimmt wird, wobei gilt: 1 < n ≤ 64.One or more computer program ^{objects of} the on-site client computer can, with at least approximate exponential distribution (density function d (x) = λ * e ^-λx ), ^define two open end intervals and between 2 ⁿ -4 intervals of decreasing width such that the upper limit ( g ₁ ) of the lower end interval one of the provisional interval limits is set so that in the lower end interval a total of about 1/2 ⁿ - 2 of all valid values and the interval limit g _end , is determined, above the total of about 1/2 ⁿ - 2 all valid values are, where λ of g ₁ and g bestimt _end to λ: = In (2 ⁿ - 2) / (g _end - g _1), and the desired width (b) of the first intermediate interval as a b: = In ( (2 ⁿ - 3) / (2 ⁿ - 2)) / λ where 1 <n ≤ 64.

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können als nächste Intervallgrenze (g₂) eine bestehende vorläufige Intervallgrenze so festlegen, dass der Betrag der Differenz der Obergrenze (g₁ ₎ des unteren Endintervalls minus nächster Intervallgrenze (g₂) minus Wunschbreite (b) (|g₂ – g₁ – b|) minimal wird, die Wunschbreite (b) des er sten Zwischenintervalls mit dem Faktor e^λb multipliziert wird und die nächsten Intervalle entsprechend berechnet werden.One or more computer program objects of the on-premises client computer may be next Inter vallgrenze (g ₂ ) set an existing provisional interval limit such that the absolute value of the difference between the upper limit (g ₁ _{) of} the lower end interval minus the next interval limit (g ₂ ) minus desired width (b) (| g ₂ - g ₁ - b |) is minimal is multiplied, the ^{desired width} (b) of the first intermediate interval with the factor e ^λb and the next intervals are calculated accordingly.

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können die am nächsten beim wahren Mittelwert (m) liegende vorläufige Intervallgrenze als Mittelpunkt (m) festlegen, die Standardabweichung (s) so festlegen, dass sie der wahren Standardabweichung möglichst nahe kommt und dass s/64 ein Vielfaches der bestehenden Intervallbreite ist, wobei die bestehenden vorläufigen Intervalle zu größeren neuen Intervallen zusammen gefasst werden, die eine abnehmende Breitenverteilung von s/64, s/32, s/16, ... haben.One or more computer program objects of the on-premises client computer can be the closest to the true mean (m) lying temporary interval limit as the center (m) set the standard deviation (s) to be the true standard deviation comes as close as possible and that s / 64 is a multiple of the existing interval width, where the existing provisional intervals to larger ones new intervals are summarized, the decreasing width distribution from s / 64, s / 32, s / 16, ....

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können boolesche Merkmale als verschiedene Merkmalsausprägungen speichern (,erster valider Wert', ,zweiter valider Wert', ,Wert nicht vorhanden', ,ungültiger Wert').One or more computer program objects of the on-premises client computer can use boolean characteristics as different characteristic values save ('first valid value', 'second valid value',, value not available ',' invalid value ').

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können nominale Merkmale für die Verwendung in der SOM-Karten-Analyse in mehrere boolesche bzw. nummerische 0/1-Merkmale aufspalten.One or more computer program objects of the on-premises client computer may have nominal characteristics for use in Split the SOM map analysis into several Boolean or numeric 0/1 features.

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können nominale Merkmale als Position des Merkmals in der Liste aller validen Werte dieses Merkmals speichern und zwei Indexwerte geführt werden, welche ,kein Wert vorhanden' und ,Wert kommt in der Liste gültiger Werte nicht vor' repräsentieren.One or more computer program objects of the on-premises client computer can use nominal characteristics as the position of the feature in the Store list of all valid values of this feature and two index values which have no value and value does not appear in the list of valid values.

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können bis zu etwa 100, vorzugsweise bis zu etwa 60, unterschiedliche Nominalwerte einzeln oder als Gruppen einzelner Nominalwerte indizieren.One or more computer program objects of the on-premises client computer may be up to about 100, preferably up to about 60, different Indicate nominal values individually or as groups of individual nominal values.

Etwa 10 bis etwa 20 oder 30, vorzugsweise etwa 15 häufigsten Werte können durch ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners zur Auswertung als Einzelwerte ausgewählt werden und alle anderen Werte unter einem Index ,andere' zu einer einzigen Wertgruppe zusammengefasst werden.Approximately 10 to about 20 or 30, preferably about 15 most frequently Values can be through one or more computer program objects of the on-site client computer selected for evaluation as individual values and all other values under one index, others' to one single value group.

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können folgende Schritte ausführen:

– Lesen aller Original-Datensätze in den Arbeitsspeicher des Vor-Ort-Client-Rechner (12) und für jedes nominale Merkmal in den Original-Daten wird ein zum ersten Mal vorkommender Wert in einer Datei ,Wörterbuch' abgespeichert, in der jedem vorkommenden Wert eine Index-Nummer zugeordnet und die jeweilige Vorkommenshäufigkeit erfasst wird;
– sobald eine benutzerdefinierte Schranke, z. B. etwa 30, 1000, oder 65534, unterschiedliche Werte in dem Wörterbuch eingetragen sind,
– Beenden des Einfügens neuer Werte, und
– Eintragen aller danach vorkommenden Werte, die nicht im Wörterbuch auftreten, unter der vorletzten letzten Indexposition, welche ,anderer Wert' wiedergibt, während die letzte Indexposition ,kein Wert vorhanden' wiedergibt.
– Ersetzen der Nominalwerte in dem Wörterbuch durch die Indexnummer mit einer 8 Bit oder 16 Bit Ganzzahl.

One or more on-premises client computer program objects may perform the following steps:

- Read all original data records into the memory of the on-site client computer ( 12 ) and for each nominal feature in the original data, a first occurring value is stored in a dictionary file in which each index value is assigned an index number and the respective occurrence frequency is recorded;
- as soon as a custom barrier, z. B. about 30, 1000, or 65534, different values are entered in the dictionary,
- Stop inserting new values, and
- Entering all subsequently occurring values that do not occur in the dictionary, below the penultimate last index position, which 'other value' reflects, while the last index position, no value present 'reflects.
Replacing the nominal values in the dictionary with the index number with an 8-bit or 16-bit integer.

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können folgenden Schritt auszuführen:

– Sortieren des Wörterbuchs nach absteigender Vorkommenshäufigkeit der Einträge.

One or more on-premises client computer program objects may perform the following step:

- Sort the dictionary by decreasing occurrences of entries.

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können eine auf die tatsächlichen Wertehäufigkeiten jedes nominalen Merkmals angepasste, Komprimierung vornehmen, welche 4 Bit oder ein Byte Speicherplatz pro nominalem Merkmal verwendet.One or more computer program objects of the on-premises client computer can one on the actual value frequencies make any compression adapted to any nominal feature, which 4 bits or one byte of memory per nominal feature used.

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können folgende Schritte auszuführen:

– Erfassen der 14 bzw. 253 häufigsten Werte als separate Werte;
– Zuweisen der Index-Nummer 0 bis 13, bzw. 0 bis 253 an diese Werte im Wörterbuch, vorzugsweise entsprechend ihrer Häufigkeit;
– Zuweisen der Index-Nummer 14 bzw. 254 an allen anderen Werte im Wörterbuch, einschießlich dem bisherigen Index ,anderer Wert';
– Zuweisen der Index-Nummer 15 bzw. 255 an den bisherigen Index ,Wert nicht vorhanden'; und
– Speichern jeder der Index-Nummern auf 4 Bit bzw. 1 Byte Speicherplatz.

- capture the 14 or 253 most frequent values as separate values;
Assigning the index numbers 0 to 13 or 0 to 253 to these values in the dictionary, preferably according to their frequency;
- Assign the index number 14 or 254 to all other values in the dictionary, including the previous index 'different value';
- Assign the index number 15 or 255 to the previous index, value not available '; and
- Store each of the index numbers on 4-bit or 1-byte memory space.

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können eine vorläufige Komprimierung mit mehreren Rechnerkernen auf partitionierten Daten parallel ausführen, globale Statistiken bzw. Wertehäufigkeiten zwischen einzelnen Threads kommunizieren und die endgültige Komprimierung parallelisiert ausführen.One or more computer program objects of the on-premises client computer can do a preliminary compression with multiple Execute computer cores in parallel on partitioned data, global statistics or value frequencies between individual Threads communicate and the final compression run parallelized.

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können eine Anonymisierung der Daten dadurch ausführen, dass die Datenkomprimierung auf dem Vor-Ort-Client-Rechner ausgeführt wird, die komprimierten Daten mit den Intervall-Indizes für nummerische Daten und Wert-Indizes für binäre und nominale Daten an den Datenanalyse-Server (10) übermittelt werden und die ,Dekomprimierungs-Informationen, für nummerische Daten Mittelwerte, Standardabweichungen, Verteilungsform, Minimum, Maximum und für binäre und nominale Daten die Werte-”Wörterbücher”, welche einen Rückschluss vom Werteindex zum tatsächlichen Wert ermöglichen, auf dem Vor-Ort-Client-Rechner (12) gespeichert werden.One or more computer program objects of the on-premises client computer may perform anonymization of the data by performing the data compression on the on-premises client computer, the compressed data with the numeric data interval indexes and value indices for binary and nominal data to the data analysis server ( 10 ) and the decompression information, for numerical data averages, standard deviations, distribution form, minimum, maximum and for binary and nominal data the value "dictionaries" which allow inference of the value index to the actual value on the on-site Client computer ( 12 ) get saved.

Ein oder mehrere Computerprogrammobjekte des Analyse-Servers können Analysen, Auswertungen, SOM-Modelle auf Basis der Intervall- und Werteindizes der komprimierten Daten erstellen und diese Ergebnisse zurück an den Vor-Ort-Client-Rechner (12) schicken.One or more Analysis Server computer program objects can generate analyzes, evaluations, SOM models based on the interval and value indices of the compressed data, and send those results back to the on-premises client computer ( 12 ).

Ein oder mehrere Computerprogrammobjekte des Vor-Ort-Client-Rechners können die Ergebnisse mit den Dekomprimierungsinformationen verknüpfen, wodurch die Ergebnisse mit den ursprünglichen Informationen zur Verfügung stehen.One or more computer program objects of the on-premises client computer can use the results with the decompression information link, which results with the original ones Information is available.

Ein erstes Computerprogrammprodukt kann ein oder mehrere Computerprogrammobjekte zur Ausführung eines oder mehrerer der vorgenannten Verfahrensschritte auf einem Vor-Ort-Client-Rechner enthalten.One first computer program product may be one or more computer program objects for carrying out one or more of the aforementioned method steps included on an on-site client machine.

Ein zweites Computerprogrammprodukt kann ein oder mehrere Computerprogrammobjekte zur Ausführung eines oder mehrerer der vorgenannten Verfahrensschritte auf einem Analyse-Server enthalten.One second computer program product may include one or more computer program objects for carrying out one or more of the aforementioned method steps contained on an analysis server.

ZITATE ENTHALTEN IN DER BESCHREIBUNGQUOTES INCLUDE IN THE DESCRIPTION

Diese Liste der vom Anmelder aufgeführten Dokumente wurde automatisiert erzeugt und ist ausschließlich zur besseren Information des Lesers aufgenommen. Die Liste ist nicht Bestandteil der deutschen Patent- bzw. Gebrauchsmusteranmeldung. Das DPMA übernimmt keinerlei Haftung für etwaige Fehler oder Auslassungen.This list The documents listed by the applicant have been automated generated and is solely for better information recorded by the reader. The list is not part of the German Patent or utility model application. The DPMA takes over no liability for any errors or omissions.

Zitierte PatentliteraturCited patent literature

EP 97115654 [0005]
- EP 97120787 [0005]

Zitierte Nicht-PatentliteraturCited non-patent literature

- T. Kohonen. Self-Organization and Associative Memory, vol. 8 of Springer Series in Information Science, 3rd edition, Springer-Verlag, Berlin, 1989 [0002]
Ballard et al., Dynamic Warehousing: Data Mining Made Easy, IBM Redbook, 2007 [0003]
- R. Otte, V. Otte, V. Kaiser, Data Mining for industrial practice, Hauser Verlag, Munich, 2004 [0004]

Claims

Electronic data processing system for the analysis of data, comprising - at least one analysis server ( 10 ) and - at least one on-premises client computer ( 12 ), where - the analysis server ( 10 ) is adapted and programmed to implement a self-adapting neuron network to be trained on a large database with a plurality of data sets with many features, wherein - the on-premises client computer ( 12 ) is set up and programmed to subject data supplied to it - data preprocessing and / or - data compression before the data from the on-premises client computer ( 12 ) via an electronic network ( 14 ) to the analysis server ( 10 ), where - the analysis server ( 10 ) is adapted and programmed to train the self-adapting neuron network with the received preprocessed / compressed data by repeating the data to the self-adapting neuron network and then performing an analysis around a self-adapting neuron network Model or another data mining analysis result, and where - the analysis server ( 10 ) is set up and programmed to send the self-adapting neuron network model or other data mining analysis result from the analysis server ( 10 ) to the on-premises client computer ( 12 ), and - the on-premises client computer ( 12 ) is adapted and programmed to decompress the data of the self-adapting neuron network model or other data mining analysis result.

An electronic data processing system according to claim 1, wherein the on-premises client computer ( 12 ) is set up and programmed to adapt the type of data compression to the type or structure of the data.

An electronic data processing system according to claim 1, wherein the on-premises client computer ( 12 ) is set up and programmed to subject the data supplied to it to the extent of up to about 10 gigabytes to several terabytes of data preprocessing and data compression.

An electronic data processing system according to claim 1, wherein the on-premises client computer ( 12 ) is set up and programmed to read the data supplied to it during the data pre-processing once, and to transform original features contained therein to purely numerical normalized features.

An electronic data processing system according to claim 1, wherein the on-premises client computer ( 12 ) is set up and programmed to compress the normalized numerical feature values of the data supplied to it during data compression, so that between two bits and about 8-10 bits is required as storage space per feature expression.

An electronic data processing system according to claim 1, wherein the on-premises client computer ( 12 ) is set up and programmed to set a data compression method to be used, depending on a compression error acceptable for the respective analysis task, the compression method to be used and the compression rate to be established depending on the different feature types (boolean, numerical, nominal (textual)).

An electronic data processing system according to claim 1, wherein the on-premises client computer ( 12 ) is set up and programmed such that the mean prediction error for numerical features is between about 0.01 and about 0.1.

An electronic data processing system according to claim 1, wherein the on-premises client computer ( 12 ) is set up and programmed to perform a discretization of the normalized numerical feature values into a number of discrete intervals such that a mean discretization error (| value - discrete value |) is about 0.001.

An electronic data processing system according to claim 1, wherein the on-premises client computer ( 12 ) is set up and programmed to map the discretized numeric value to 2 ⁿ - 2 interval indices, as well as states 'value not present' and 'invalid value' and to store n bits of memory, where 1 <n ≤ 64.

An electronic data processing system according to claim 9, wherein the on-premises client computer ( 12 ) is set up and programmed such that interval division is value-distribution-dependent, non-equidistant, wherein preferably the interval widths of the discrete subintervals in areas of high value densities are determined to be particularly low.

An electronic data processing system according to claim 9 or 10, wherein the on-premises client computer ( 12 ) is designed and programmed so that numerical data value distributions following the Gaussian or normal distribution with an average (m) and a standard deviation (s), • approximately 64 intervals of width s / 64 in the ranges [m-s , m [and [m, m + s]; • about 32 intervals of width s / 32 in the ranges [m - 2s, m - s [and [m + s, m + 2s [; • about 16 intervals of width s / 16 in the ranges [m - 3s, m - 2s [and [m + 2s, m + 3s [; • about 8 intervals of width s / 8 in the ranges [m - 4s, m - 3s [and [m + 3s, m + 4s [; • about 4 intervals of width s / 4 in the ranges [m - 5s, m - 4s [and [m + 4s, m + 5s [; • about 2 intervals of width s / 2 in the ranges [m - 6s, m - 5s [and [m + 5s, m + 6s [; and • about 1 interval of infinite width for] -∞, m-6s [and [m + 6s, ∞ [. are fixed.

An electronic data processing system according to any one of claims 9 to 11, wherein the on-premises client computer ( 12 ) is set up and programmed such that the discrete intervals as a function of mean (m) and standard deviation (s) are distributed symmetrically about the mean m, preferably - the interval [m - s / 64, m [the interval position 127, and The interval [m, m + s / 64] has the position 128, and where preferably - the numerical value assigned to each interval is the interval midpoint, and the interval positions 0 and 255 are reserved for invalid or missing values, respectively ,

An electronic data processing system according to any one of claims 9 to 12, wherein the on-premises client computer ( 12 ) is set up and programmed so that numerical features are first mapped to normalized features with mean m = 0.5 and a standard deviation of s = 0.25.

An electronic data processing system according to any one of claims 9 to 13, wherein the on-premises client computer ( 12 ) is set up and programmed to read a fraction of about 1 per thousand to about 10% of the data sets for determining the mean value (m) and the standard deviation (s) of the feature to be discretized, and from this a preliminary mean value for each numerical characteristic in the data (m ^(vorl) ) and to calculate a preliminary spread s ^(vorl) .

An electronic data processing system according to any one of claims 9 to 14, wherein the on-premises client computer ( 12 ) is set up and programmed to read all data records and to perform a preliminary discretization for all numerical features based on the provisional average (m ^(pre) and the preliminary spread (s ^(pre) ).

An electronic data processing system according to any of claims 9 to 15, wherein the on-premises client computer ( 12 ) is set up and programmed to have 65532 equidistant intervals of latitude s ^(vorl) / 256 centered about the tentative mean (m ^(vorl) ), and two open end intervals] -∞, m ^(vorl) - 32766/256 · s ^{(ex )} [and [m ^(vorl) + 32766/256 * s ^(vorl) , ∞ [and two interval ^{indices representing} 'value not present' and 'invalid numeric value', and for all intervals to log the frequencies with which Value falls in the respective intervals.

An electronic data processing system according to claim 16, wherein the on-premises client computer ( 12 ) is set up and programmed to analyze the logged records the frequencies of the interval occupancies and to derive therefrom the value distribution form and the matching final discretization.

An electronic data processing system according to any of claims 9 to 17, wherein the on-premises client computer ( 12 ) is set up and programmed to form, with at least approximate equal distribution, two open end intervals and between 2 ⁿ - 4 equidistant intervals, in which one of the provisional interval limits is set as upper limit of the lower end interval such that in the lower end total are about 1/2 ^{n of} all valid values and the width of the 2 ⁿ -4 equidistant intervals is the smallest multiple of the provisional interval width, which increases the occupation of the remaining upper end interval to not more than 1/2 ^{n of} all valid values where 1 <n ≤ 64.

An electronic data processing system according to any of claims 9 to 17, wherein the on-premises client computer ( 12 ) is set up and programmed, with at least approximate exponential distribution (density function d (x) = λ · e ^-λx ), to ^define two open end intervals and between 2 ⁿ -4 intervals of decreasing width such that the upper limit (g ₁ _{) of} the lower end interval one of the provisional interval limits is set so that in the lower end interval a total of about 1/2 ⁿ - 2 of all valid values and the interval limit g _end , is determined to be above the total of about 1/2 ⁿ - 2 of all valid values, where λ of g ₁ and g _end bestimt becomes λ: = In (2 ⁿ - 2) / (g _end - g _1), and the desired width (b) of the first intermediate interval as a b: = In ((2 ⁿ - 3) / (2 ⁿ - 2)) / λ, where 1 <n ≤ 64.

An electronic data processing system according to the preceding claim, wherein the on-premises client computer ( 12 ) is set up and programmed to set an existing provisional interval limit as the next interval limit (g ₂ ) such that the magnitude of the upper limit difference (g ₁ ) of the lower end interval minus the next interval limit (g ₂ ) minus desired width (b) (| g ₂ - g ₁ - b |) becomes minimal, multiply the ^{desired width} (b) of the first intermediate interval by the factor e ^λb and calculate the next intervals accordingly.

Electronic data processing system according to one of the preceding claims, wherein the on-site client computer ( 12 ) is set up and programmed to set the tentative interval limit closest to the true mean (m) as the center point (m), to set the standard deviation (s) as closely as possible to the true standard deviation, and s / 64 is a multiple of the existing interval width, the existing provisional intervals being grouped together at larger new intervals having a decreasing width distribution of s / 64, s / 32, s / 16, ....

Electronic data processing system according to one of the preceding claims, wherein the on-site client computer ( 12 ) is set up and programmed to store Boolean characteristics as different characteristic values ('first valid value', 'second valid value', 'value not available', 'invalid value').

Electronic data processing system according to one of the preceding claims, wherein the on-site client computer ( 12 ) is set up and programmed to split nominal features for use in SOM map analysis into a plurality of Boolean and numeric 0/1 features, respectively.

An electronic data processing system according to the preceding claim, wherein the on-premises client computer ( 12 ) is set up and programmed to store nominal features as the position of the feature in the list of all valid values of that feature and to keep two index values representing 'no value present' and 'value does not appear in the list of valid values'.

Electronic data processing system according to one of the preceding claims, wherein the on-site client computer ( 12 ) is configured and programmed to index up to about 100, preferably up to about 60, different nominal values individually or as groups of individual nominal values.

Electronic data processing system according to preceding claim, wherein the approximately 10 to about 20 or 30, preferably about 15 most frequent values for evaluation as Single values are selected and all other values below an index, others' combined into a single value group become.

Electronic data processing system according to one of the preceding claims, wherein the on-site client computer ( 12 ) is set up and programmed to carry out the following steps: - reading all the original data sets into the on-premises client computer memory ( 12 ) and for each nominal feature in the original data, store a first occurring value in a dictionary file in which each occurring value is assigned an index number and the respective occurrence frequency is detected; - as soon as a custom barrier, z. 30, 1000, or 65534, different values are entered in the dictionary, - ending the insertion of new values, and - entering all thereafter occurring values, which do not occur in the dictionary, below the penultimate last index position, which, other value ' while the last index position, no value exists' dergibt. Replacing the nominal values in the dictionary with the index number with an 8-bit or 16-bit integer.

An electronic data processing system according to the preceding claim, wherein the on-premises client computer ( 12 ) is set up and programmed to perform the following step: - Sorting the dictionary according to descending occurrence frequency of the entries.

An electronic data processing system according to the preceding claim, wherein the on-premises client computer ( 12 ) is adapted and programmed to perform a compression adapted to the actual nominal value frequencies of each nominal feature using 4 bits or one byte of storage space per nominal feature.

An electronic data processing system according to the preceding claim, wherein the on-premises client computer ( 12 ) is set up and programmed to carry out the following steps: acquiring the 14 or 253 most frequent values as separate values; Assigning the index numbers 0 to 13 or 0 to 253 to these values in the dictionary, preferably according to their frequency; - Assign the index number 14 or 254 to all other values in the dictionary, including the previous index 'different value'; - Assign the index number 15 or 255 to the previous index, value not available '; and - storing each of the index numbers at 4 bits and 1 byte memory, respectively.

An electronic data processing system according to the preceding claim, wherein the on-premises client computer ( 12 ) is set up and programmed to carry out a preliminary compression with multiple computer cores on partitioned data in parallel, to communicate global statistics or value frequencies between individual threads and to execute the final compression in parallel.

Electronic data processing system according to one of the preceding claims, wherein the on-site client computer ( 12 ) is configured and programmed to perform anonymization of the data by performing the data compression on the on-premises client computer, the compressed data with the numeric interval indexes, and binary and nominal value indexes the data analysis server for the numeric data averages, standard deviations, distribution form, minimum, maximum and for binary and nominal data the value "dictionaries" that allow inference from the value index to the actual value on the On-site client computers are stored, and the analysis server creates analyzes, evaluations, SOM models based on the interval and value indices of the compressed data and sends those results back to the on-premises client computer, wherein it is set up and programmed to associate the results with the decompression information which results are available with the original information.

Method for analyzing data in an electronic data processing system, comprising at least one analysis server ( 10 ) and at least one on-premises client computer ( 12 ), the analysis server ( 10 ) comprises one or more computer program objects for implementing a self-adapting neuron network to be trained on a database having a plurality of records having a plurality of features, the on-premises client computer ( 12 ) has one or more computer program objects for data-preprocessing and / or data compression applied to it to receive the data from the on-premises client computer ( 12 ) to the analysis server ( 10 ), whereby one or more computer program objects of the analysis server ( 10 ) train the self-adapting neuron network with the received, preprocessed / compressed data by repeatedly presenting the data to the self-adapting neuron network and one or more analysis server computer program objects ( 10 ) then perform an analysis to create a self-adapting neuron network model or other data mining analysis result, and one or more analysis server computer program objects ( 10 ) sending the self-adapting neuron network model or other data mining analysis result from the analysis server ( 10 ) to the on-premises client computer ( 12 ) and one or more computer program objects of the on-premises client computer ( 12 ) decompress the data of the self-adapting neuron network model or other data mining analysis result.

The method of claim 33, wherein one or more computer program objects of the on-premises client computer ( 12 ) adapt the type of data compression to the type or structure of the data.

Method according to Claim 33, in which one or more computer program objects of the analysis server ( 10 ) train the self-adapting neuron network with the received anonymized data as many times as necessary until an out-of-convergence network state is established that adequately represents the data.

Method according to Claim 33, in which one or more computer program objects of the analysis server ( 10 ) present the data to the self-adapting neuron network about 100 to about 200 times.

The method of claim 33, wherein one or more computer program objects of the on-premises client computer ( 12 ) submitting the data supplied to it in the amount of up to about 10 gigabytes to several terabytes of data preprocessing and data compression.

The method of claim 33, wherein one or more computer program objects of the on-premises client computer ( 12 ) once read the data supplied to it during data preprocessing, and transform original features contained therein to purely numerical normalized features.

The method of claim 33, wherein one or more computer program objects of the on-premises client computer ( 12 ) Compress the normalized numerical feature values of the data supplied to it in the data compression, so that between two bits and about 8-10 bits is required as storage space per feature expression.

The method of claim 33, wherein one or more computer program objects of the on-premises client computer ( 12 ) determine a data compression method to be used in dependence on a compression error acceptable for the respective analysis task, the compression method to be used and the compression rate to be achieved being determined depending on the different feature types (Boolean, numeric, nominal (textual)).

The method of claim 33, wherein one or more computer program objects of the on-premises client computer ( 12 ) set the mean prediction error for numerical features between about 0.01 and about 0.1.

The method of claim 33, wherein one or more computer program objects of the on-premises client computer ( 12 ) perform a discretization of the normalized numerical feature values into a number of discrete intervals such that a mean discretization error (| value - discrete value |) is about 0.001.

The method of claim 42, wherein one or more computer program objects of the on-premises client computer ( 12 ) map the discretized numerical value to 2 ⁿ - 2 interval indices, as well as 'value not present' and 'invalid value' and store it on n bit of memory where 1 <n ≤ 64.

A method according to claim 42 or 43, wherein one or more computer program objects of the on-premises client computer ( 12 ) set the interval division depending on the distribution of distribution, non-equidistant, wherein preferably the interval widths of the discrete subintervals in areas of high value densities are determined to be particularly low.

A method according to claim 42, 43 or 44, wherein one or more computer program objects of the on-premises client computer ( 12 ) for numerical data value distributions following the Gaussian or normal distribution with an average value (m) and a standard deviation (s), • approximately 64 intervals of width s / 64 in the ranges [m-s, m [and [m, m + s]; • about 32 intervals of width s / 32 in the ranges [m - 2s, m - s [and [m + s, m + 2s [; • about 16 intervals of width s / 16 in the ranges [m - 3s, m - 2s [and [m + 2s, m + 3s [; • about 8 intervals of width s / 8 in the ranges [m - 4s, m - 3s [and [m + 3s, m + 4s [; • about 4 intervals of width s / 4 in the ranges [m - 5s, m - 4s [and [m + 4s, m + 5s [; • about 2 intervals of width s / 2 in the ranges [m - 6s, m - 5s [and [m + 5s, m + 6s [; and • about 1 interval of infinite width for] -∞, m-6s [and [m + 6s, ∞ [. establish.

Method according to one of Claims 40 to 45, in which one or more computer program objects of the on-site client computer ( 12 ) distribute the discrete intervals as a function of mean (m) and standard deviation (s) symmetrically about the mean m, where preferably - the interval [m - s / 64, m [the interval position 127, and - the interval [m, m + s / 64 [which has position 128, and where preferably - the numerical value assigned to each interval is the interval midpoint, and the 0 or 255 interval positions are reserved for invalid or missing values, respectively.

The method of claim 33, wherein one or more computer program objects of the on-premises client computer ( 12 ) first map numerical features to normalized features with mean m = 0.5 and a standard deviation of s = 0.25.

Method according to one of Claims 40 to 47, in which one or more computer program objects of the on-site client computer ( 12 ) for determining the mean value (m) and the standard deviation (s) of the feature to be discretized from a fraction of about 1 per thousand to about 10% of the records read from this for each numerical feature in the data a provisional average (m ^(vorl) ) and calculate a preliminary spread (s ^(f) ).

Method according to one of Claims 40 to 48, in which one or more computer program objects of the on-site client computer ( 12 ) to read all data sets and to carry out provisional discretization for all numerical features based on the provisional mean (m ^(vorl) ) and the preliminary spread (s ^(vorl) ).

Method according to one of Claims 40 to 49, in which one or more computer program objects of the on-site client computer ( 12 ) 65532 equidistant intervals of width s ⁽ ^vorl ⁾ / 256 centered around the provisional mean (m ^(vorl) ), and two open end intervals] -∞, m ^(vorl) - 32766/256 · s ^(vorl ) [and [m ^{( readl} ) + 32766/256 · s ^(vorl) , ∞ [and two interval ^indexes that specify 'value not present' as well as 'invalid numerical value' and for all intervals record the frequencies with which a value falls within the respective intervals.

The method of claim 50, wherein one or more computer program objects of the on-premises client computer ( 12 ) that analyze the frequencies of the interval occupations and derive therefrom the value distribution form and the corresponding final discretization.

Method according to one of Claims 40 to 51, with one or more computer program objects of the on-site client computer ( 12 ) form at least approximately equal distribution two open end intervals and between 2 ⁿ - 4 equidistant intervals, in which the upper limit of the lower end interval one of the provisional interval limits is set so that in the lower end of a total of about 1/2 ^{n of} all valid values and width of the 2 ⁿ - 4 equidistant intervals is set the least multiple of the provisional interval width, which makes the occupation of the remaining upper end interval increase to not more than 1/2 ^{n of} all valid values, where 1 <n ≤ 64.

Method according to one of claims 40 to 52, in one or more computer program objects of the on-site client computer ( 12 ) with at least approximate exponential distribution (density function d (x) = λ · e ^-λx ) ^define two open end intervals and between them 2 ⁿ -4 intervals of decreasing width such that the upper limit (g ₁ ) of the lower end-in interval is one of the provisional interval limits is determined, so that in the lower end interval a total of about 1/2 ⁿ - 2 of all valid values are and the interval limit g _end , is determined to be above the total of about 1/2 ⁿ - 2 of all valid values, where λ from g ₁ and Determines g _end as λ: = In (2 ⁿ - 2) / (g _end - g ₁ ), and the desired width (b) of the first intermediate interval as b: = In ((2 ⁿ - 3) / (2 ⁿ - 2)) / λ, where 1 <n ≤ 64.

Method according to the preceding claim, in which one or more computer program objects of the on-site client computer ( 12 ) define as the next interval limit (g ₂ ) an existing provisional interval limit such that the magnitude of the difference of the upper limit (g ₁ ) of the lower end interval minus the next interval limit (g ₂ ) minus desired width (b) (| g ₂ - g ₁ - b |) is minimized, the ^{desired width} (b) of the first intermediate ^{interval is} multiplied by the factor e ^λb , and the next intervals are calculated accordingly.

Method according to one of the preceding method claims, wherein one or more computers program objects of the on-site client computer ( 12 ) set the tentative interval boundary closest to the true mean (m) as the center (m), set the standard deviation (s) as close as possible to the true standard deviation, and s / 64 be a multiple of the existing interval width; existing interim intervals may be grouped together at larger new intervals having a decreasing width distribution of s / 64, s / 32, s / 16, ....

Method according to one of the preceding method claims, wherein one or more computer program objects of the on-site client computer ( 12 ) store Boolean characteristics as different characteristic values ('first valid value', 'second valid value', 'value not available', 'invalid value').

Method according to one of the preceding method claims, wherein one or more computer program objects of the on-site client computer ( 12 ) splits nominal features into multiple Boolean or numeric 0/1 features for use in SOM map analysis.

Method according to the preceding claim, wherein one or more computer program objects of the on-site client computer ( 12 ) store nominal features as the position of the feature in the list of all valid values of that feature and keep two index values representing 'no value present' and 'value does not appear in the list of valid values'.

Method according to one of the preceding method claims, wherein one or more computer program objects of the on-site client computer ( 12 ) up to about 100, preferably up to about 60, different denominations individually or as groups of individual denominations.

A method according to the preceding claim, wherein from about 10 to about 20 or 30, preferably about 15 most frequently Values for evaluation are selected as individual values and all other values under one index, others' into a single one Value group are summarized.

Method according to one of the preceding method claims, wherein one or more computer program objects of the on-site client computer ( 12 ) perform the following steps: - Read all original data records into the on-premises client computer memory ( 12 ) and for each nominal feature in the original data, storing a first occurring value in a dictionary file in which each occurring value is assigned an index number and the respective occurrence frequency is detected; - as soon as a custom barrier, z. 30, 1000, or 65534, different values are entered in the dictionary, - ending the insertion of new values, and - entering all thereafter occurring values, which do not occur in the dictionary, below the penultimate last index position, which, other value ' while the last index position, no value exists'. Replacing the nominal values in the dictionary with the index number with an 8-bit or 16-bit integer.

Method according to the preceding claim, wherein one or more computer program objects of the on-site client computer ( 12 ) to perform the following step: - Sort the dictionary according to decreasing frequency of occurrences of the entries.

Method according to the preceding claim, wherein one or more computer program objects of the on-site client computer ( 12 ) make a compression adapted to the actual value frequencies of each nominal feature using 4 bits or one byte of memory per nominal feature.

Method according to the preceding claim, wherein one or more computer program objects of the on-site client computer ( 12 ) carry out the following steps: acquiring the 14 or 253 most frequent values as separate values; Assigning the index numbers 0 to 13 or 0 to 253 to these values in the dictionary, preferably according to their frequency; - Assign the index number 14 or 254 to all other values in the dictionary, including the previous index 'different value'; - Assign the index number 15 or 255 to the previous index, value not available '; and - storing each of the index numbers at 4 bits and 1 byte memory, respectively.

Method according to the preceding claim, wherein one or more computer program objects of the on-site client computer ( 12 ) perform preliminary compression with multiple machine cores in parallel on partitioned data, communicate global stats between individual threads, and perform the final compression in parallel.

Method according to one of the preceding method claims, wherein one or more computer program objects of the on-site client computer ( 12 ) perform anonymization of the data by performing the data compression on the on-premises client computer, the compressed data with the numeric interval indexes and binary and nominal value indexes to the data analysis server ( 10 ) and the decompression information, for numerical data mean values, standard deviations, distribution form, minimum, maximum and for binary and nominal data the value "dictionaries" which allow a conclusion of the value index to the actual value, on the on-site Client computer ( 12 ) save.

Method according to one of the preceding method claims, wherein one or more computer program objects of the analysis server ( 10 ) Create analyzes, evaluations, SOM models based on the interval and value indices of the compressed data and send these results back to the on-premises client computer ( 12 ).

Method according to one of the preceding method claims, wherein one or more computer program objects of the on-site client computer ( 12 ) link the results to the decompression information, providing the results with the original information.

Computer program product containing one or more computer program objects for executing one or more of the aforementioned method steps on an on-site client computer ( 12 ).

Computer program product containing one or more computer program objects for executing one or more of the aforementioned method steps on an analysis server ( 10 ).