Location-Transparent Distributed Datacube Processing

Semantic-rich query languages enable automatic query splitting and asymmetric load balancing

It is a common first-semester Computer Science insight that general-purpose (“Turing-complete”) languages cannot be understood by another program, which is necessary, among others, to generate true distributed execution plans. However, restricted models indeed are amenable to such analysis - in particular, the SQL database language which (i) has a focused data structure, tables, and (ii) in its core does not have iteration.

We present a similar high-level query language based on a multi-dimensional array model, also known as “datacubes”, and its implementation in the rasdaman Array DBMS, a database system centered around arrays (rather than tables). The query optimizer can generate both local and distributed plans and find efficient work splittings. The net effect for the user is that disparate, independent rasdaman deployments can be federated in a location-transparent manner: users do not need to know where the data sit that get accessed in a query. In the extreme case, cloud-based data centers can get federated with small edge devices like Raspberry Pi, and the optimizer can generate plans which take into account the asymmetric capabilities of the nodes involved.

This language has been adopted as part 15 of the ISO SQL standard. A slightly modified version named Web Coverage Processing Service (WCPS), which additionally incorporates space-time semantics, has been standardized by the Open Geospatial Consortium and ISO. In rasdaman, WCPS queries internally get mapped to SQL MDA queries which suhbsequently get evaluated, including distributed processing.

We present concepts, architecture, and live demos on geo datacubes representing weather forecasts and satellite image timeseries.