JDBC bridge + bigquery #154

andaag · 2022-07-05T10:11:30Z

So, I'd love for some official documentation on how to combine this with bigquery's jdbc bridge.

I just tested this... and surprisingly it works. Essentially just dumped all the jar files from the bigquery connector (issue pending to publish it in maven... currently a zip download) into the drivers directory, and set up a source for it.

{
  "$schema": "../../../../../docker/config/datasource.jschema",
  "bigquery": {
    "driverClassName": "com.simba.googlebigquery.jdbc.Driver",
    "jdbcUrl": "jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;ProjectId=nansen-endpoints-dev;OAuthType=0;OAuthPvtKeyPath=/app/bigquery/sa.json;OAuthServiceAcctEmail=SERVICE_ACCOUNT@PROJECT_ID.iam.gserviceaccount.com",
    "initializationFailTimeout": 0,
    "minimumIdle": 0,
    "maximumPoolSize": 10
  }
}

This for us might be an interesting option. We're looking into some data sync jobs between bigquery and clickhouse, and it would allow us to use dbt materialized views for this for example. Probably wouldn't be optimal performance, but in terms of ease of use for our developers it would be a pretty interesting option.

How .. insane is this? It feels ... wrong. At the same time it did work fairly quickly. Is it worth spending the time to tuning timeouts and such to be able to use this to materialize bigquery data in clickhouse? I'd love to avoid our teams needing spark etc for this.

zhicwu · 2022-07-05T13:05:02Z

You're not crazy :) As long as the datasource has fully functional JDBC driver, you should be able to access that from ClickHouse, via JDBC bridge.

Probably wouldn't be optimal performance

The overhead of JDBC bridge is around 10% - 20% according to my test earlier, and it does not count the initial query for type inferring. However, I found it's acceptable for processing millions of rows or even near realtime monitoring(grafana + distributed query against various databases). In the case of cross-region data sync, it's actually faster than direct connect for some databases because of lz4 compression.

Is it worth spending the time to tuning timeouts and such to be able to use this to materialize bigquery data in clickhouse?

Unfortunately I'd suggest you limiting the usage for below reasons:

JDBC bridge has issues in both design and implementation
- there's no access control so you cannot share access with others - it may work for DataOps but definitely not for users
- too many requests(ping + type inferring + native query) for one single query
- lack of execution context - Pass execution context to table functions for accessing external datasource ClickHouse#33803
- lack of support for server-side configuration - Maintain list of trusted servers for access control ClickHouse#33805 (and no function/table for discovering JDBC bridge instances)
- implementation issue like Repeated queries sent to JDBC brige ClickHouse#36385
Timeout can be tricky
- you have to consider many timeout settings: JDBC drivers, ClickHouse, and JDBC bridge
- if you increase timeout for long running queries, small queries may hang for too long when there's network issue
- and remember ClickHouse server can only connect to one JDBC bridge server(and you don't have much control over DNS-based load balancing)...
Replace JDBC bridge by clickhouse-data-service(a new module in clickhouse-jdbc) in the future?
It's just my personal opinion, but maybe it's better to implement a new data service using existing url table function along with enhanced JDBC driver for more features(like access control, query rewrite, encryption, data masking, zero configuration etc.) and better performance(less round trips). It may not be fully backward compatible.

andaag · 2022-07-05T14:49:25Z

Thanks for the detailed writeup!

Can you link me to some more information on clickhouse-data-service ? Sounds very worth checking out.

zhicwu · 2022-07-05T22:32:39Z

Can you link me to some more information on clickhouse-data-service ?

Sorry it does not exist and I only mentioned it in ClickHouse/clickhouse-java#784 :p An alternative and more generic implementation as far as I know of, is trinodb/trino#1839.

andaag · 2022-07-06T09:21:17Z

Alright, thanks for your help!

I'll leave it up to you whether or not to leave this ticket as documentation or to close it.

zhicwu added the question Further information is requested label Jul 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JDBC bridge + bigquery #154

JDBC bridge + bigquery #154

andaag commented Jul 5, 2022

zhicwu commented Jul 5, 2022

andaag commented Jul 5, 2022

zhicwu commented Jul 5, 2022

andaag commented Jul 6, 2022

JDBC bridge + bigquery #154

JDBC bridge + bigquery #154

Comments

andaag commented Jul 5, 2022

zhicwu commented Jul 5, 2022

andaag commented Jul 5, 2022

zhicwu commented Jul 5, 2022

andaag commented Jul 6, 2022