Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to parse the Service result as JSON #1427

Open
tarcisiotmf opened this issue Jul 31, 2024 · 11 comments
Open

Failed to parse the Service result as JSON #1427

tarcisiotmf opened this issue Jul 31, 2024 · 11 comments

Comments

@tarcisiotmf
Copy link

When executing the query below with qlever the error below is raised. I have executed the same query in graphdb and it worked as expected. You can replicate the error with the following links:

Executing query with qlever

Executing query with graphdb, select emi-dbgi repository

The dataset used in our test is available here.

exception": "Failed to parse the Service result as JSON. First 100 bytes: SPARQL-QUERY: queryStr=...

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX emi: <https://purl.org/emi#>
PREFIX sosa: <http://www.w3.org/ns/sosa/>
PREFIX prov: <http://www.w3.org/ns/prov#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?extract ?organe ?species_name ?genus_name ?family_name ?count_of_selected_class
WHERE
    {  
    ?material sosa:hasSample ?extract .
        ?material sosa:isSampleOf ?organe .
        ?organe emi:inTaxon ?wd_sp .
        OPTIONAL
        {
            SERVICE <https://query.wikidata.org/sparql> {
            ?wd_sp wdt:P225 ?species_name .
            ?family wdt:P31 wd:Q16521 ;
                wdt:P105 wd:Q35409 ;
                wdt:P225 ?family_name ;
                ^wdt:P171* ?wd_sp .
            ?genus wdt:P31 wd:Q16521 ;
                wdt:P105 wd:Q34740 ;
                wdt:P225 ?genus_name ;
                ^wdt:P171* ?wd_sp 
            }
        }
        {
            SELECT ?extract (COUNT(DISTINCT ?feature) AS ?count_of_selected_class)
            WHERE
            {   
                ?extract rdf:type emi:ExtractSample .
                ?extract sosa:isFeatureOfInterestOf ?lcms .
                ?lcms rdf:type emi:LCMSAnalysis .
                ?lcms emi:hasLCMSFeatureSet ?feature_list .
                ?feature_list emi:hasLCMSFeature ?feature .
                ?feature emi:hasAnnotation ?canopus .
            	?canopus rdf:type emi:ChemicalTaxonAnnotation . 
                ?canopus emi:hasClass ?np_class .
            	?np_class rdfs:label "Aspidosperma type" .
                ?canopus emi:hasClassProbability ?class_prob .
                FILTER((?class_prob > 0.5)) .
            } GROUP BY ?extract ORDER BY DESC(?count_of_selected_class)
        }
    }  
@tuukka
Copy link

tuukka commented Jul 31, 2024

"the Service result" probably means what query.wikidata.org returns to QLever. The result clearly is not valid JSON as it starts with SPARQL-QUERY (and not with a JSON object). The query is quite slow, so it could be a timeout in query.wikidata.org.

You could use QLever's Wikidata endpoint instead, if you change the SERVICE to the following: SERVICE <https://qlever.cs.uni-freiburg.de/api/wikidata>

@tuukka
Copy link

tuukka commented Jul 31, 2024

If I switch from WDQS to QLever Wikidata, the error is different: Blank nodes in the result of a SERVICE are currently not supported. For now, consider filtering them out using the ISBLANK function or converting them via the STR function.

Finally, after I added the ISBLANK filters, it timed out after 120 seconds: https://qlever.cs.uni-freiburg.de/wikidata/1gsnSL

@tarcisiotmf
Copy link
Author

tarcisiotmf commented Jul 31, 2024

Thanks for the quick reply and support! I am really impressed with the performance of Qlever demos.

I have changed the service request to your SPARQL endpoint, and I had the following error after more than 1 minute of execution. However, the subquery in the service query has only projections without blank nodes as values.

{
    "exception": "Blank nodes in the result of a SERVICE are currently not supported. For now, consider filtering them out using the ISBLANK function or converting them via the STR function.",
    "query": "PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>\nPREFIX wdt: <http://www.wikidata.org/prop/direct/>\nPREFIX wd: <http://www.wikidata.org/entity/>\nPREFIX emi: <https://purl.org/emi#>\nPREFIX sosa: <http://www.w3.org/ns/sosa/>\nPREFIX prov: <http://www.w3.org/ns/prov#>\nPREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n\nSELECT ?extract ?organe ?species_name ?genus_name ?family_name ?count_of_selected_class\nWHERE\n    {  \n    ?material sosa:hasSample ?extract .\n        ?material sosa:isSampleOf ?organe .\n        ?organe emi:inTaxon ?wd_sp .\n        OPTIONAL\n        {\n            SERVICE  <https://qlever.cs.uni-freiburg.de/api/wikidata> {\n            ?wd_sp wdt:P225 ?species_name .\n            ?family wdt:P31 wd:Q16521 ;\n                wdt:P105 wd:Q35409 ;\n                wdt:P225 ?family_name ;\n                ^wdt:P171* ?wd_sp .\n            ?genus wdt:P31 wd:Q16521 ;\n                wdt:P105 wd:Q34740 ;\n                wdt:P225 ?genus_name ;\n                ^wdt:P171* ?wd_sp \n            }\n        }\n        {\n            SELECT ?extract (COUNT(DISTINCT ?feature) AS ?count_of_selected_class)\n            WHERE\n            {   \n                ?extract rdf:type emi:ExtractSample .\n                ?extract sosa:isFeatureOfInterestOf ?lcms .\n                ?lcms rdf:type emi:LCMSAnalysis .\n                ?lcms emi:hasLCMSFeatureSet ?feature_list .\n                ?feature_list emi:hasLCMSFeature ?feature .\n                ?feature emi:hasAnnotation ?canopus .\n            \t?canopus rdf:type emi:ChemicalTaxonAnnotation . \n                ?canopus emi:hasClass ?np_class .\n            \t?np_class rdfs:label \"Aspidosperma type\" .\n                ?canopus emi:hasClassProbability ?class_prob .\n                FILTER((?class_prob > 0.5)) .\n            } GROUP BY ?extract ORDER BY DESC(?count_of_selected_class)\n        }\n    }  \n",
    "resultsize": 0,

The second issue to rely on your Wikidata endpoint, it may result on not having access to the latest Wikidata data. Moreover, the no JSON response from Wikidata is due to an error in Wikidata (out of memory). After a quick test, if we try to run the Wikidata part of the query without considering the results from the outer query (qlever BGP processing), we get either a timeout or out of memory error because Qlever is not considering the results of the outer query to filter the results from the inner query (service subquery - query plan related issue?).

SPARQL-QUERY: queryStr=select ?wd_sp ?species_name ?genus_name ?family_name {
            ?wd_sp wdt:P225 ?species_name .
            ?family wdt:P31 wd:Q16521 ;
                wdt:P105 wd:Q35409 ;
                wdt:P225 ?family_name ;
                ^wdt:P171* ?wd_sp .
            ?genus wdt:P31 wd:Q16521 ;
                wdt:P105 wd:Q34740 ;
                wdt:P225 ?genus_name ;
                ^wdt:P171* ?wd_sp  } #limit 100
java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: org.openrdf.query.QueryEvaluationException: java.lang.RuntimeException: java.util.concurrent.ExecutionException: com.bigdata.rwstore.sector.MemoryManagerOutOfMemory
	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(FutureTask.java:206)

@tuukka
Copy link

tuukka commented Jul 31, 2024

However, the subquery in the service query has only projections without blank nodes as values.

Here's an example of a taxon whose taxon name is an "unknown value" in Wikidata, and this is represented as a blank node in RDF: https://www.wikidata.org/wiki/Q21362983

If the top-level query result shouldn't have any such results, it may be a signal that the query plan is indeed non-optimal regarding this subquery.

The second issue to rely on your Wikidata endpoint, it may result on not having access to the latest Wikidata data.

I believe there's work in progress to implement rolling updates in QLever's Wikidata endpoint. [I'm not a member of the QLever developer team but a Wikidata contributor.]

Moreover, the no JSON response from Wikidata is due to an error in Wikidata (out of memory).

Right - WDQS is known to have scaling issues, but this could also be because of QLever making a non-optimal query plan?

we get either a timeout or out of memory error because Qlever is not considering the results of the outer query to filter the results from the inner query (service subquery - query plan related issue?).

Just to clarify, are you saying that QLever is not sending the values of ?wd_sp (count 747) from the outer query to WDQS for the subquery? (It would be nice to have a way to see the exact subquery that is being sent to WDQS.)

If you have a QLever UI for your endpoint, you can click the button "Analysis" to view the query plan. (Otherwise, it's included in the JSON response which may be difficult to read.)

I don't know if the heuristics can be tweaked, but by reorganising the subqueries, I'm indeed getting the query to complete (with one result - is that the correct result?): https://qlever.cs.uni-freiburg.de/wikidata/uAxDnP

@tarcisiotmf
Copy link
Author

tarcisiotmf commented Jul 31, 2024

Thanks, please see below my replies:

Right - WDQS is known to have scaling issues, but this could also be because of QLever making a non-optimal query plan?

Yes, I think so.

Just to clarify, are you saying that QLever is not sending the values of ?wd_sp (count 747) from the outer query to WDQS for the subquery?
It looks like to be it since we are getting out of memory from Wikidata. When querying with a limit it works. Or explicitly assigning ?wd_sp in the subquery.

If you have a QLever UI for your endpoint, you can click the button "Analysis" to view the query plan. (Otherwise, it's included in the JSON response which may be difficult to read.)

Sorry, I don't have it but I have provided the query and dataset to replicate the issue.

I don't know if the heuristics can be tweaked, but by reorganising the subqueries, I'm indeed getting the query to complete (with one result - is that the correct result?): https://qlever.cs.uni-freiburg.de/wikidata/uAxDnP

Yes, thanks. You can verify it with the link I provided when I opened this issue. When executing the same query in graphdb (that is also faster for this query) at the sparql endpoint: https://biosoda.unil.ch/graphdb/repositories/emi-dbgi) :

querying with graphdb

@tuukka
Copy link

tuukka commented Aug 1, 2024

Finally, after I added the ISBLANK filters, it timed out after 120 seconds: https://qlever.cs.uni-freiburg.de/wikidata/1gsnSL

@hannahbast Could you have a brief look at this issue? Basically, are the statements within a SERVICE clause always sent to the remote endpoint as is and the join done afterwards, or is it somehow possible to get QLever to evaluate local triple patterns first and send the resulting bindings to the remote endpoint (by adding VALUES statements, I imagine)?

@joka921
Copy link
Member

joka921 commented Aug 1, 2024

Hi @tuukka and @tarcisiotmf ,

Thanks for your interest in QLever. I found the time to look at your issue, and I can say the following:

  1. The limitation concerning Blank Nodes in Service Queries is unfortunate, but will be fixed eventually (I have to assign someone or myself to it:)), but it can typically be worked around.
  2. Since rather recently, QLever in principle supports the constraining of SERVICE queries by enriching them with VALUES clauses from the enclosing query. This mechanism currently has two limitations:
    2.1. It only sends the VALUES clause, if it has at most 100 entries. This default is too low, it can be set by issuing a GET request to
    <urlOfTheSparqlServer>/?service-max-value-rows=<newIntegerValue>&access-token=<yourAccessToken> . However
    This is not a problem for your concrete query, the outer context only boils down to a single result.
    2.2 (Very relevant for you): QLever does NOT know how to constrain a SERVICE, if the service is inside an OPTIONAL clause,
    and the constraining triples are outside the OPTIONAL. In General, QLever is currently not good at optimizing OPTIONALs, we always fully evaluate the content of the OPTIONAL and then join it with Everything that stands before it in the query.

You current workarounds thus are:

  1. Drop the OPTIONAL around the SERVICE (then your query works in a reasonable time).
  2. Duplicate the Constraining context INSIDE the OPTIONAL (ugly, but allows to preserve the exact semantics).

In general: Nice that you have set up a local SPARQL endpoint for your data. I would highly recommend to also set up the qlever ui, as its analysis capabilities are really really useful, especially when sharing the results.

@tuukka
Copy link

tuukka commented Aug 1, 2024

  1. Drop the OPTIONAL around the SERVICE (then your query works in a reasonable time).

Nice! With this approach, I'm seeing times a bit below 3s when the cache is clean.

With the better query plan, you can also go back to federating to query.wikidata.org if you wish: https://qlever.cs.uni-freiburg.de/wikidata/BkGzpj

@tuukka
Copy link

tuukka commented Aug 1, 2024

2. In General, QLever is currently not good at optimizing OPTIONALs, we always fully evaluate the content of the OPTIONAL and then join it with Everything that stands before it in the query.

By the way, I think this is something that a lot of queries currently suffer from and it seems it's not easy to work around.

@joka921 Could a solution to this be prioritized? For example, we add optional labels and other supplementary information as an afterthought with the intuition that for a reasonable number of results, it shouldn't be a lot of additional computation. I see #314 is related but this one should be much easier - more or less the same as sending VALUES to a SERVICE clause? Should we open a separate issue?

@tarcisiotmf
Copy link
Author

@joka921 Thank you very much for the clarification about the current Qlever limitations. Thanks, @tuukka, for your interest in this issue, it has been really helpful !

Please, do as you think it is more appropriate (closing this issue, and opening new issues). In my opinion, there are 3 different issues resulted from this one:

  • blank nodes limitation in the service clause
  • Values clause limited to 100 for service calls? a parameter to include in the qlever file? It might be possible already but it is not clearly documented I think, maybe I missed it. Moreover, what is the reason to have a low default limit? it might be simply addressed by not having a default limit or having the highest possible?
  • constrain a SERVICE, if the service is inside an OPTIONAL clause

@joka921
Copy link
Member

joka921 commented Oct 18, 2024

Hi,
There is an update to your various issues:
QLever has been supporting the SERVICE/OPTIONAL optimization which is required for your query for a while now.
And since two minute ago, QLever also supports blank nodes in the result of SERVICE requests. Note that this requires rebuilding your index.
Please try out, if your original query works with the current master (the docker image will need some hours until i updated, The latest relevant PR that was merged was #1504 ).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants