~Guessing potential RDF vocabularies from natural language texts using LDA~

On-line availability of text corpora nowadays allow data practitioners to build complex knowledge combining various sources. One common shared challenge lays in the modelisation of intermediate knowledge structures able to gather at once the various topics present in the texts. Practically, practitioners often go through the creation of vocabularies. In order to help these domain experts, we design LOV-ES: a solution able to help them in this creative process, guiding them in the selection and the combination of already existing vocabularies available online. Technically, our solution relies on LDA to detect topics and on the LOV to then propose candidate vocabularies.

The proposed architecture performs two main tasks: (1) applying LDA to identify underlying topics of a given text block, and (2) ranking candidate vocab- ularies considering their relevance to the topics. In detail, we rely on the original version of LDA based on two hyperparameters α (controlling the prior distri- bution over topic weights in each document) and β (setting the prior distribution over word weights in each topic), respectively set to 0.1 and 0.01. Each topic word bag resulting from applying LDA is then used to build a SPARQL query aim- ing at extracting candidate vocabularies. Typically, the words are given to the SPARQL endpoint taking advantage of the VALUES variable passing method as presented schematically below:

     SELECT ?word ?voc WHERE {
     VALUES ?word { "word1" "word2" "word3" } # Word list from an LDA bag.
     ?voc a lod:Vocabulary . ?voc dcterms:description ?description . [...]
     FILTER ( CONTAINS ( STR(?description),?word ) ) }

The rest of the query then returns candidate vocabularies ?voc if occurrences of ?word are present within the vocabulary description, making the assumption that specific words would be present to describe more generic vocabularies. Once candidate vocabularies are obtained we sort them according to a metric described in our article to display them by descreasing order.

Nothing for the moment...