Hashing SPARQL query results

~A SPARQL-based approach to track RDF updates~

The recent increase of RDF usage has witnessed a rising need of "verification" around obtained data from SPARQL endpoints. It is now possible to deploy Semantic Web pipelines and to adapt them to a wide range of needs and use-cases. Practically, these complex ETL pipelines relying on SPARQL endpoints to extract relevant information often have to be relaunched from scratch every once in a while in order to refresh their data. Such a habit adds load on the network and is heavy resource-wise, while sometimes unnecessary if data remains untouched.
Here, we present a useful method to help data consumers (and pipeline designers) identify when data has been updated in a way that impacts the pipeline's result set. This method is based on standard SPARQL 1.1 features and relies on digitally signing parts of query result sets to inform data consumers about their eventual change.

The SPARQL 1.1 standard provides a large set of built-in functions, from ones dedicated to strings to specific ones about dates. These can be used by query designers to refine their result set. In particular, the standard offers a set of five hash functions: MD5, SHA1, SHA256, SHA384 & SHA512.

As we know, on the same endpoint, the same query (without calls to functions like RAND or NOW) is suppose to return the same result for the same dataset. Therefore, we think this SPARQL-based lightweight signing approach could be useful for ETL pipeline designers. Indeed, a common challenge for pipeline designers is to know when a refresh, i.e. a re-run (often from scratch), is needed, following a data update. Most of the time, there is no way to know a priori that datasets have been updated and, thereby, pipelines are often run uselessly when nothing has been modified. This, unfortunately, leads to time-consuming and (sometimes) costly processes in terms of both resources and network bandwidth, as multiple intermediate results involved by the pipelines are shuffled.
To tackle this issue, a hash of the results could be computed by the endpoint itself and be compared with a previously obtained hash the user would have saved. Incase of a mismatch, the query (and the rest of the pipeline) could be run again. Assuming Q is the considered SPARQL select query, we propose the following steps to generate a query which computes the hash of the results of Q:

  1. Extract and sort the list of distinguished variables V (if a * is given, the considered variables are the ones involved in the where {...});
  2. Wrap Q in a select * ... query ordered by V (this is required as group_concat isn't deterministic otherwise);
  3. Embed the obtained query in a select query computing the hash of the grouped concatenation of the cast (to string) distinguished variables.

We developed a Javascript function able to take a SELECT query and transform it so that its execution would return the hash of the results. Technically, our solution is using sparql.js to properly parse SPARQL 1.1.








Choose your preferred hash function:

(Note: a SELECT query is expected...)


Inria's logo TrinityCollege's logo ADAPT's logo