Raising the bar for Earth Observation data processing

2020-09-10

“Enable researchers to focus more on the science and less on the plumbing” is the core message of this great article (opens new window) from Jeff de La Beaujardière, where he summarizes the upcoming challenges and opportunities with Earth Observation (EO) data as well as describes the path to a generic EO data processing solution - by introducing the concept of a Geodata Fabric.

We at EOX are following these ideas and concepts since a while, for our own large-scale processing needs with our cloudless mosaic products, and in our project work like for Euro Data Cube (opens new window) (EDC), an initiative supported by the European Space Agency (ESA) to create a concrete representation of such Geodata Fabric - we called it EO Information Factory - to foster the long term vision of building up a Digital Twin Earth. The ESA EO Φ-Week (opens new window) virtual event provides an opportunity to acknowledge the latest achievements going in that direction, with EDC being presented and showcased in two side-events (opens new window).

If you are following the trends in EO data processing you are probably familiar with the great blog post series by Chris Holmes on Cloud Native Geospatial (opens new window) and Analysis Ready Data (opens new window) - setting up the stage why EO data should be cloud friendly and analysis ready.

Cloud friendly data enables analysis straight on the source data files by relying on structured metadata for partial data reads as requested by the spatial and temporal extents of interest - without requiring the download of the whole file. Several such cloud-optimized formats have gained popularity, with Cloud Optimized GeoTIFF (opens new window) (COG) and Zarr (opens new window) being the most prominent ones, both on the way to become an agreed standard. Relying on them for persistent storage allows lazy loading of data, materializing only these chunks of data which are really needed for processing.

Analysis ready datasets (ARD) are defined to have all necessary transformations and aggregations applied and the required metadata included, making them directly consumable for domain-specific processing - from temporal analysis to statistical forecasts to AI and Machine Learning (ML) use cases.

If ARDs are stored in a cloud friendly format, the only missing piece is an elastic and scalable compute layer - a workspace environment with all the infrastructure resources to run processes in a cost-effective manner, enabling reusable and reproducible workflow execution, both interactively and in a headless fashion (for on demand, systematic and scheduled invocation).

We at EOX operate such cloud workspaces, providing different flavors of computational resources and storage options in various clouds, to run customer workloads close to the particularly needed EO data archives, like Amazon Public Data Sets (opens new window) (PDS) or the Mundi Dias data offerings (opens new window) . Please reach out to us for details. Note: our AWS EU-Central-1 cloud workspace offering, the EDC EOxHub Workspace (opens new window), is accessible in self-service mode!

But also in the case the needed data is not in cloud optimized format yet or still at a lower processing level (i.e. not analysis ready), you are not on your own with the data preparation steps.

One option is to use managed service offerings like EDC Sentinel Hub Batch processing (opens new window) or the EDC xcube Generator (opens new window), both taking care to connect to a data archive, transforming the loaded data based on a supplied processing script and staging the result data into object storage - in a cloud friendly and (now hopefully) analysis ready way, to be used for subsequent domain-specific processing as laid out before.

Another possibility is to leverage cloud workspaces like the above mentioned EDC EOxHub Workspace (opens new window) also for pre-processing. There are tons of great EO libraries (opens new window) for accessing and transforming EO data, many of them freely and publicly available, with a curated selection of them bundled within the individual cloud workspace offerings and ready for use.

We will host a science user testimonial at ESA EO Φ-Week demonstrating such data preparation step based on a concrete example. So if you are not only interested in how Henrik Fisser could detect trucks on roads across Europe using free 10m imagery (opens new window), but also on how he used the xcube-sh library (opens new window) (a python package bundled within EDC) to shuffle 38 TB of imagery data with tens of millions API calls from the EDC Sentinel Hub API (opens new window) to his cloud workspace before he applied the truck detection algorithm, you should definitely checkout this (opens new window)Φ-Week side-event (opens new window).