
Data Factory
A data factory is used to systematically generate, process and make available large volumes of necessary data for the various use cases of the digitalized rail system. One prominent use case is the training of systems that work with artificial intelligence (AI).
Project duration
Our partners









In 2022, Digitale Schiene Deutschland began establishing the "Data Factory." This initiative enables the systematic generation, management, and provision of large volumes of essential data for various use cases within the digitized railway system.
On the one hand, digital systems generate vast amounts of data — for example, sensors like cameras mounted on trains capture detailed information about the rail environment. On the other hand, digitalization also requires the systematic creation of data — for instance, to sufficiently train artificial intelligence (AI). A well-known use case is training AI software for sensor-based perception systems. This is especially needed in the context of creating a digital representation of railway infrastructure or for fully automated, driverless train operation (known as Grade of Automation 4, or GoA4).
To develop such AI software — for example, for environmental perception — large volumes of real sensor data and simulated data of very high quality are required.
Organizing this data for training AI-based functions via a unified cloud and IT platform presents a major challenge. The platform enables various stakeholders (operators, manufacturers, AI specialists, etc.) to access and utilize this data.
The Data Factory is illustrated in the image below, showing the data flow from the train to the trackside and all the way to the backend.

The Path of the Data
Trains can be equipped with sensors such as cameras, infrared cameras, LiDARs, radars, and localization sensors.
During operation, these sensors capture the railway environment, including infrastructure elements and other objects.
Due to the variety of sensor types, the resulting data is referred to as multimodal (sensor) data.

The train is equipped with the Vehicle Data Logger (1), which records the multimodal sensor data and ensures data integrity. The Vehicle Data Logger is rail-certified and features a large storage capacity (120 TB) as well as Wi-Fi technology for wireless data transmission.

Once the train is parked, the collected data is wirelessly transmitted to the Data Touchpoint (2). The Data Touchpoint is a trackside edge-cloud solution equipped with storage and computing capacity. Within the Data Touchpoint, the data is reduced, pre-processed, and then transmitted via Wi-Fi to the Data Center (3).
The DB-operated Data Center (on-premises) includes a large-scale storage infrastructure (>5 PB) as well as several Nvidia DGX and Nvidia OVX servers for demanding machine learning, simulation, and analytics tasks. A Microsoft Azure-based Hyper Converged Infrastructure (HCI) serves as the computing platform, while additional cloud integrations — such as the use of AWS S3 object storage — complete this highly scalable infrastructure solution.
These IT assets form the foundation for the software toolchain, which provides coordinated applications and platforms for managing and processing the data.

Included within this system are the Data Pipelines (4). These are containerized applications that run on the Vehicle Data Logger, the Data Touchpoint, and within the Data Center. Some of their components include functions for data recording, data quality assurance, and data transformation, as well as capabilities for data export and import.

The required data volumes can be generated both by recording real sensor data from the track environment and by creating synthetic data using various simulation environments.

The data from the real railway environment, recorded with the Vehicle Data Logger and processed through the Data Pipelines, must subsequently be annotated. This means that areas within images containing the objects to be learned are marked. These marked areas are referred to as annotations.

The Data Platform (5) is the central hub where all sensor data, analysis results, and annotations converge. It enables structured data management, provides powerful search capabilities, and supports the visualization of multimodal data. In addition, it serves as the central data interface for all functions of the software toolchain and supplies the required data to both external customers and internal DB stakeholders.

Data Analytics (6) brings together various topics related to extracting information from data. On one hand, machine learning-based AI functions are developed, trained, and evaluated. These AI functions are used to perform automated data analysis and the detection of objects such as people, vehicles, or infrastructure elements like overhead line masts, PZB magnets, manholes, and cable ducts.

All sensor data, annotations, detections, and analysis results are consolidated into datasets (7). These datasets can be made available to both internal and external stakeholders. In addition, selected portions of the data are published as open datasets in collaboration with the German Centre for Rail Transport Research (Deutsches Zentrum für Schienenverkehrsforschung DZSF).

- Fully automated train operation requires trains equipped with front sensors and AI functions. The development of AI functions for driverless driving requires enormous amounts of sensor data. The efficient capture and storage of this data is handled by the Data Factory. A key project for the development of this application is the AutomatedTrain project, which tests automatic equipping and de-equipping as well as fully automated provisioning and stabling of trains.
- Collecting sensor data on railway tracks is complex and requires consideration of numerous technical and regulatory aspects. Additionally, test drive routes and vehicles are scarce and expensive. The Data Factory has the expertise to generate such data under these demanding conditions. This unique feature significantly distinguishes the railway sector from the automobile sector, where car manufacturers can easily collect data on public roads.
- To support research and industrial development, the Data Factory is committed to ensuring non-discriminatory access to this data.
- Notable are the activities of the Data Factory together with the German Centre for Rail Transport Research (DZSF) of the Federal Railway Authority (EBA). Within the framework of the joint project, the first freely available annotated multimodal sensor dataset OSDaR23 was published. A more extensive dataset, OSDaR25, will follow soon.
- The renovation of railway infrastructure in Germany is essential to ensure reliable services and increase capacity for rising passenger traffic. The digital inventory of the infrastructure is crucial. Through uniform recording and assessment of the condition of tracks and trains, targeted and efficient renovation can be achieved.
- Sensor data from tracks and particularly the AI analysis tools of the Data Factory could help in (predictive) maintenance. Automated recognition of overhead line masts, PZB magnets, canal, and cable infrastructure has already been implemented prototypically and can be profitably expanded. There is great potential in equipping measurement and inspection trains, which traverse the entire German rail network multiple times a year and could represent this as a digital twin.
To date, there are hardly any public data sets from the rail sector. DB InfraGo AG has therefore created and published the first publicly available multi-sensor data set OSDaR23 as part of the "Digitale Schiene Deutschland" sector initiative together with the German Centre for Rail Traffic Research (DZSF) at the German Federal Railway Authority (EBA).

The data set consists of time-synchronized sensor data from:
- 3 high-resolution cameras, 3 medium-resolution cameras, 3 infrared cameras
- 3 long-range LiDARs, 1 medium-range Lidar, 2 short-range LiDARs
- 1 long-range radar, 4 inertial measurement units, 4 GPS/GNSS sensors
The data set contains annotations of 20 object classes, has the annotation format ASAM Open Label and can be downloaded here.
Further Information:
https://digitale-schiene-deutschland.de/Downloads/ETR-OSDaR23.pdf
https://digitale-schiene-deutschland.de/en/news/2022/Data-Factory
The evaluation of sensor data in fully automated driving is also likely to be carried out by AI models developed using machine learning (ML) algorithms based on suitable training, validation and test data. The basis for ML are data sets from input and output data, which an ML algorithm can use to "learn". In the case of object recognition in ATO, the input data is sensor data, for example, which records the relevant areas, e.g. the route in front of the train. The objects to be recognized (e.g. people, tracks) must also be detected. The output data includes all data that the ML model to be developed is to derive from the input data. This includes, for example, location coordinates of the areas in which the objects to be recognized are located, classifications of the objects or characteristics of object attributes.

Due to the large amount of data that will be required for the development, testing and approval of the ATO functions, at least semi-automatic pre-analysis appears to make sense. Certain objects, situations or environmental conditions can then be automatically identified in the recorded data. This makes it possible to find specific data - such as recorded wild animals or objects located at particularly relevant distances or zones around the track. These can then be annotated so that they can be used to train the AI processes. Special weather conditions such as rain or driving snow can also be automatically recognized and added to the data as a machine-generated description. In future, the overall assessment of the situation, i.e. how relevant the data is for AI training or testing, will also be evaluated using the computer.
Setting up and operating a data factory for a fully digitalized rail system is a major task. There is therefore a consensus in the rail sector that individual railroad companies or manufacturers will not be able to provide enough sensor data in the future to be able to train the numerous AI functions sufficiently. The European rail sector is therefore considering the creation of a "Pan-European Railway Data Factory" with a shared infrastructure that will enable rail companies and manufacturers across Europe to collect, process and simulate sensor data and make it available for mutual use.
The implementation strategy for the Pan-European Railway Data Factory (PEDF) is divided into short, medium and long-term measures. In the short term, the focus is on individual technical and legal solutions for individual national data factories. In the medium term, the aim is to assimilate standards in order to enable the gradual integration of the data factories of individual members. In the long term, the aim is to achieve comprehensive coordination of standardization efforts, particularly with regard to data quality, formats, interfaces and interconnectivity.
The participation paths for PEDF members include interface coordination for flexibility in data exchange and toolchain coordination for the harmonization of tool chains. The strategy is characterized by its pragmatic and step-by-step development to make PEDF a versatile and effective pan-European initiative.

Digitale Schiene Deutschland therefore helped launch the "Rail Data Factory" project as part of the "CEF2 Digital" funding program and conducted a study co-funded by the European Health and Digital Executive Agency (HADEA) together with the French railroad SNCF and the Dutch railroad NS. The aim was to assess the feasibility of a Pan-European Railway Data Factory from a technical, economic, regulatory and operational perspective. The study started in January 2023 and was completed in December 2023. A so-called Rail Advisory Board and close synchronization with data factory-related activities in the Europe's Rail funding project "R2DATO2" ensure that the study takes into account the needs of the rail sector and was carried out in line with comparable activities.
In addition to the development of the architecture and an implementation plan, a key result was the confirmation that the establishment of a Pan-European Railway Data Factory is highly relevant for the project participants.
Further Information:
https://digitale-schiene-deutschland.de/en/news/2023/Pan-European-Railway-Data-Factory
Project duration Pan-European Railway Data Factory
The ERJU "R2DATO" project (ERJU = Europe's Rail Joint Undertaking) aims to develop a joint innovation roadmap for rail operators and manufacturers for future Europe-wide digital and automated rail transport and to develop and test the necessary technological enablers for this.
On the one hand, aspects of the requirements for the Data Factory will be developed and shared with the project members. On the other hand, the Data Factory prototype of Digitale Schiene Deutschland will be built in parallel. The requirements focus on the assets in the data center and the future tool chain as well as on data quality and annotation.
The tool chain includes a data platform that handles data management and visualization. It also includes tools for annotating (sensor) data and a simulation platform that synthesizes artificial data (see section 2). The training and evaluation of AI functions is carried out in the machine learning platform, which is also part of the tool chain. The Testing & Certification Platform is intended to support the future approval of AI functions and an Access & Information Platform ensures the seamless interaction of the individual tools.
Building on the results of the CEF2 study (section 6), the concept of the Pan-European Data Factory is being further pursued and developed here. The aim is to merge independent data factories and IT assets using a high-speed network, define common interfaces and create a standardized toolchain.
The standardized toolchain is intended to ensure data sovereignty and enable non-discriminatory access to data for stakeholders. In addition, it should create synergies in data collection, data processing and AI development and enable the approval of AI functions.
A legal opinion will clarify which areas of law are relevant in the R2DATO WP7 project. This expert opinion will thus form the basis for the further development of the concept of a consortium-led pan-European data factory.
Another component of the work package is the sector-wide coordination of data simulation and data annotation. In cooperation with the project partners, the first step is to simulate non-regular scenarios and annotate sensor data.
The digital map (digital Register WP27) will provide exemplary map data.
Finally, an open data set will be created containing real sensor data, annotations, simulated sensor data and map data.
Project duration
Our Partners
















Specialized articles
-
Pan-European Railway Data Factory – infrastructure and ecosystem for fully automated rail operations | April 2024
Many European railways are striving toward automated rail operations. This requires the collection of extensive sensor data for AI training. A Pan-European Railway Data Factory (PEDF), as a joint infrastructure and partner ecosystem, is seen as a suitable way forward for the sector. This article summarises the highlights and findings of the CEF2 RailDataFactory study undertaken by Deutsche Bahn AG (DB), Société nationale des chemins de fer français (SNCF) and Nederlandse Spoorwegen N.V. (NS).
Source: SIGNAL+DRAHT
-
Study result "D1 - Data Factory Concept, Use Cases and Requirements" | June 2023
The "pan-European Railway Data Factory" is a kind of ecosystem with a shared infrastructure that enables rail companies and manufacturers to collect, process and simulate sensor data across Europe and make it available for mutual use.
-
Open multisensor data set for the development of environment perception in fully automated driving | April 2023 (only in German)
Machine learning methods will also be used for environment perception in automated driving in railroad operations. However, the data sets required for their development are currently hardly available to the public. Such a multi-sensor data set was created and published in a project by the DZSF and DB Netz AG as part of the Digitale Schiene Deutschland sector initiative.
Source: Eisenbahntechnische Rundschau