The recent increase
of RDF
usage has witnessed a rising need of "verification" around
obtained data from SPARQL endpoints. It is now possible to
deploy Semantic Web pipelines and to adapt them to a wide
range of needs and use-cases. Practically, these complex ETL
pipelines relying on SPARQL endpoints to extract relevant
information often have to be relaunched from scratch every
once in a while in order to refresh their data. Such a habit
adds load on the network and is heavy resource-wise, while
sometimes unnecessary if data remains untouched.
Here, we present a useful method to help data consumers (and
pipeline designers) identify when data has been updated in a
way that impacts the pipeline's result set. This method is
based on standard SPARQL 1.1 features and relies on
digitally signing parts of query result sets to inform data
consumers about their eventual change.
The SPARQL
1.1 standard provides a large set of built-in functions,
from ones dedicated to strings to specific ones about
dates. These can be used by query designers to refine their
result set. In particular, the standard offers a set of
five hash
functions: MD5
, SHA1
, SHA256
, SHA384
& SHA512
.
As we know, on the same endpoint, the same query (without
calls to functions like RAND
or NOW
) is suppose to return the same result
for the same dataset. Therefore, we think this
SPARQL-based lightweight signing approach could be
useful for ETL pipeline designers. Indeed, a common
challenge for pipeline designers is to know when a
refresh, i.e. a re-run (often from scratch), is
needed, following a data update. Most of the time,
there is no way to know
a priori that datasets have been updated and,
thereby, pipelines are often run uselessly when nothing has
been modified. This, unfortunately, leads to time-consuming
and (sometimes) costly processes in terms of both resources
and network bandwidth, as multiple intermediate results
involved by the pipelines are shuffled.
To tackle this issue, a hash of the results could be
computed by the endpoint itself and be compared with a
previously obtained hash the user would have saved. Incase
of a mismatch, the query (and the rest of the pipeline)
could be run again. Assuming Q
is the
considered SPARQL select
query, we propose the
following steps to generate a query which computes the hash
of the results of Q
:
V
(if a *
is given,
the considered variables are the ones involved in
the where {...}
);Q
in a select * ...
query
ordered by V
(this is required
as group_concat
isn't deterministic
otherwise);select
query
computing the hash of the grouped concatenation of the
cast (to string) distinguished variables.
We developed a Javascript function able to take a
SELECT
query and transform it so that its
execution would return the hash of the results. Technically,
our solution is
using sparql.js
to properly parse SPARQL 1.1.