“Enable researchers to focus more on the science and less on the plumbing” is the core message of this great article from Jeff de La Beaujardière, where he summarizes the upcoming challenges and opportunities with Earth Observation (EO) data as well as describes the path to a generic EO data processing solution – by introducing the concept of a Geodata Fabric.
We at EOX are following these ideas and concepts since a while, for our own large-scale processing needs with our cloudless mosaic products, and in our project work like for Euro Data Cube (EDC), an initiative supported by the European Space Agency (ESA) to create a concrete representation of such Geodata Fabric – we called it EO Information Factory – to foster the long term vision of building up a Digital Twin Earth. The ESA EO Φ-Week virtual event provides an opportunity to acknowledge the latest achievements going in that direction, with EDC being presented and showcased in two side-events.
If you are following the trends in EO data processing you are probably familiar with the great blog post series by Chris Holmes on Cloud Native Geospatial and Analysis Ready Data – setting up the stage why EO data should be cloud friendly and analysis ready.
Cloud friendly data enables analysis straight on the source data files by relying on structured metadata for partial data reads as requested by the spatial and temporal extents of interest – without requiring the download of the whole file. Several such cloud-optimized formats have gained popularity, with Cloud Optimized GeoTIFF (COG) and Zarr being the most prominent ones, both on the way to become an agreed standard. Relying on them for persistent storage allows lazy loading of data, materializing only these chunks of data which are really needed for processing.
Analysis ready datasets (ARD) are defined to have all necessary transformations and aggregations applied and the required metadata included, making them directly consumable for domain-specific processing – from temporal analysis to statistical forecasts to AI and Machine Learning (ML) use cases.
If ARDs are stored in a cloud friendly format, the only missing piece is an elastic and scalable compute layer – a workspace environment with all the infrastructure resources to run processes in a cost-effective manner, enabling reusable and reproducible workflow execution, both interactively and in a headless fashion (for on demand, systematic and scheduled invocation).
We at EOX operate such cloud workspaces, providing different flavors of computational resources and storage options in various clouds, to run customer workloads close to the particularly needed EO data archives, like Amazon Public Data Sets (PDS) or the Mundi Dias data offerings . Please reach out to us for details. Note: our AWS EU-Central-1 cloud workspace offering, the EDC EOxHub Workspace, is accessible in self-service mode!
But also in the case the needed data is not in cloud optimized format yet or still at a lower processing level (i.e. not analysis ready), you are not on your own with the data preparation steps.
One option is to use managed service offerings like EDC Sentinel Hub Batch processing or the EDC xcube Generator, both taking care to connect to a data archive, transforming the loaded data based on a supplied processing script and staging the result data into object storage – in a cloud friendly and (now hopefully) analysis ready way, to be used for subsequent domain-specific processing as laid out before.
Another possibility is to leverage cloud workspaces like the above mentioned EDC EOxHub Workspace also for pre-processing. There are tons of great EO libraries for accessing and transforming EO data, many of them freely and publicly available, with a curated selection of them bundled within the individual cloud workspace offerings and ready for use.
We will host a science user testimonial at ESA EO Φ-Week demonstrating such data preparation step based on a concrete example. So if you are not only interested in how Henrik Fisser could detect trucks on roads across Europe using free 10m imagery, but also on how he used the xcube-sh library (a python package bundled within EDC) to shuffle 38 TB of imagery data with tens of millions API calls from the EDC Sentinel Hub API to his cloud workspace before he applied the truck detection algorithm, you should definitely checkout this Φ-Week side-event.