-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data from web traffic collection makes Datasette instance a lot slower #3
Comments
With the new data from our first traffic collection on the web (tweaselORG/data.tweasel.org#3), the request database has gotten quite huge and I have seen debug-adapter runs take a long time and even fail due to timeouts. Luckily, the timeouts at least can be solved quite easily. Turns out, we were already trying to load all matching rows in a single request. Now, we instead load it in chunks of 100 rows. Setting the _size parameter was sufficient for that—we had already implemented handling of the next_url in responses and row limits. With this change, I haven't seen any timeouts anymore so far and while still slower, it isn't too bad anymore. I still want to work on making data.tweasel.org as a whole faster again, but this is a good change in any case.
Solving the timeouts in TrackHAR was easy enough at least by make use of chunking: tweaselORG/TrackHAR#90 |
I was able to make one more "optimization": We previously kept the "Only include requests that are made to the same endpointUrl by apps from at least two different vendors" rule as-is for websites, but using the hostname here. That was perhaps a bit generous. This way, it was enough for an With this, the database has shrunk to 7.9 GB and we now have 1,912,743 requests with 1,592,854 of those being from websites. |
This has helped a little but the performance is still not great. And quite a few more complex queries, including ones on the homepage, are still timing out. I fear that there isn't much more that we can do in terms of optimizations. I really see only two options:
What do you think, @zner0L? |
Another random thought I just had, inspired by accidentally having two different databases in my local instance: Datasette can handle multiple databases. We could have separate databases for web and apps. But upon thinking about this a little more, it probably isn't a good idea:
|
I tried running the server in immutable mode as suggested by documentation and it did improve the performance a bit, even though it still take quite long. I think part of this is also that some queries are just inefficient. We can try to squeeze more performance out of this instance with better queries and optimized data storage. However, SQLite is just not built for this amount of data. Right now we are using a pretty inefficient software stack to handle too big of a dataset. I really don’t like solving these kinds of problems with better hardware… |
To run in immutable mode while using directory configuration mode, datasette needs the database mentioned in This improves the performance a bit and allows for the database to be downloaded. However, it introduces the problem where to generate and store the |
Oh, interesting. I also ran into the issue of not being able to combine configuration directory mode and immutable mode when I initially set up Datasette but didn't find the trick of using That being said, from my understanding after reading the docs, Since our deployment method for the database is currently "copy and run this |
With the new data from our first traffic collection on the web (tweaselORG/data.tweasel.org#3), the request database has gotten quite huge and I have seen debug-adapter runs take a long time and even fail due to timeouts. Luckily, the timeouts at least can be solved quite easily. Turns out, we were already trying to load all matching rows in a single request. Now, we instead load it in chunks of 100 rows. Setting the _size parameter was sufficient for that—we had already implemented handling of the next_url in responses and row limits. With this change, I haven't seen any timeouts anymore so far and while still slower, it isn't too bad anymore. I still want to work on making data.tweasel.org as a whole faster again, but this is a good change in any case.
Documenting an investigation that happen a while ago: @mal-tee suggested that adding indexes for more than just the facet columns might help. Starting from the assumption that the only potential negative impact of just indexing everything should be to the write performance, I learned that the subject is quite complex indeed and there are whole books about it. For example, I'm not sure whether I knew before that there are multi-column indexes. Anyway, I also learned that SQLite has an I tried that for one of our most used queries: select count(distinct initiator) initiatorCount, count(*) requestCount, endpointUrl from requests where endpointUrl is not null and platform='web' group by endpointUrl order by initiatorCount desc limit 101; It suggested the following index: CREATE INDEX requests_idx_b0c35c3d ON requests(platform, endpointUrl); However, unless I'm severely misunderstanding something here, this actually massively decreased the query performance. Whereas previously, the query would take 7.766s on my machine, with the index, it takes 19.515s. Just for the heck of it, I also tried creating single-column indexes for all columns appearing in the query but that had no significant effect (7.746s). This all seems very counter-intuitive to me and I feel like I'm missing something. But given that it seems like adding indexes can negatively impact read performance after all and that our request database is specifically for exploration where we cannot know the queries users run in advance, I fear that indexes may not be the best way forward. |
@mal-tee and especially @zner0L also kept saying that this is just an inherent limitation of SQLite and that we would need to use a "proper" DBMS like Postgres. I find that hard to believe and feel like 2.6M rows isn't thaaat much. In any case, it seems sensible to verify it (especially since Postgres support for Datasette isn't completely out of the picture). The first hurdle is finding a way to import the data from SQLite into a Postgres database. I struggled with various tools. In the end, I managed to do this using DataGrip by loading both databases into the same workspace and using the "Import/Export" -> "Copy Tables to" feature. Screencast.from.2024-10-22.10-52-46.mp4That took 2min 15s, which seems very reasonable. To make the comparison fair, I manually created the same indexes as we have in SQLite: data.tweasel.org/scripts/make-database.ts Lines 104 to 105 in 3217f01
|
For reference, here are the versions of both systems I'm using for the test. SQLite:
Postgres:
To create the Postgres database, I used Docker.
version: '3.7'
services:
db:
image: postgres:17
volumes:
- data:/var/lib/postgresql/data
env_file: ./.env
ports:
- '${HOST_PORT}:5432'
volumes:
data:
POSTGRES_DB=tweasel-postgres-test
POSTGRES_USER=tweasel-postgres-test
POSTGRES_PASSWORD=pw
POSTGRES_HOST=localhost
HOST_PORT=5437 |
Now for the interesting part. I ran the same query in both DBMS: select count(distinct initiator) appCount, count(1) requestCount, endpointUrl from requests
where endpointUrl is not null and (platform='web')
group by endpointUrl order by appCount desc limit 101; In SQLite, this took 21s:
Whereas in Postgres, it took less than 7s:
While this is far from a scientific test, a 3x speed difference is certainly not nothing. Looks like I was wrong! |
Testing a few other queries from our homepage. select count(1) count, endpointUrl from requests where endpointUrl is not null
group by endpointUrl order by count desc limit 101; SQLite: 263ms, Postgres: 964ms select dataset, id, initiator, platform, runType, startTime, method, httpVersion, endpointUrl, scheme, host, port, path, content, headers, cookies
from requests
where host like '%google-analytics%'
order by length(content) + length(path) + length(headers) + length(cookies) desc limit 101; SQLite: 1s 17ms, Postgres: 1s 41ms select dataset, id, initiator, platform, runType, startTime, method, httpVersion, endpointUrl, scheme, host, port, path, content, headers, cookies
from requests
where initiator like 'com.airbnb.android@%'
limit 101; SQLite: 130ms, Postgres: 113ms -- SQLite
select dataset, id, initiator, platform, runType, startTime, method, httpVersion, endpointUrl, scheme, host, port, path, content, headers, cookies
from requests
WHERE content like '%ea70edc1-ac05-481c-8d2a-66b1be496a7e%' or path like '%ea70edc1-ac05-481c-8d2a-66b1be496a7e%' or headers like '%ea70edc1-ac05-481c-8d2a-66b1be496a7e%' or cookies like '%ea70edc1-ac05-481c-8d2a-66b1be496a7e%'
limit 101 collate nocase;
-- Postgres
select dataset, id, initiator, platform, runType, startTime, method, httpVersion, endpointUrl, scheme, host, port, path, content, headers, cookies
from requests
WHERE content like '%ea70edc1-ac05-481c-8d2a-66b1be496a7e%' or path ilike '%ea70edc1-ac05-481c-8d2a-66b1be496a7e%' or headers ilike '%ea70edc1-ac05-481c-8d2a-66b1be496a7e%' or cookies ilike '%ea70edc1-ac05-481c-8d2a-66b1be496a7e%'
limit 101; SQLite: 723ms, Postgres: 153ms select dataset, id, initiator, platform, runType, startTime, method, httpVersion, endpointUrl, scheme, host, port, path, content, headers, cookies
from requests
where runType = 'consent-dialog-accepted' and initiator is not null and not exists (
select * from requests r where r.initiator = requests.initiator and r.endpointUrl = requests.endpointUrl and r.runType = 'no-interaction'
)
limit 10; SQLite: 1m 15s, Postgres: 677ms (!) select length(content) + length(path) + length(headers) + length (cookies) as length, dataset, id, initiator, platform, runType, startTime, method, httpVersion, endpointUrl, scheme, host, port, path, content, headers, cookies from requests
order by length desc
limit 10; SQLite: 1m 31s, Postgres: 7s (!) |
This is quite interesting. For the quick queries, SQLite can sometimes even win against Postgres. But some major queries that takes ages in SQLite are lightning fast in Postgres. |
But anyway, if there ever is support for Postgres in Datasette, it's quite a while away. The reason I did decide to look into this again is that I happened to stumble across DuckDB again a few days ago. And that sounds very relevant to our use case (quoting from https://duckdb.org/why_duckdb#fast):
And notably, not only is Datasette support for DuckDB more likely than Postgres (simonw/datasette#968), there is even a plugin that enables support right now since DuckDB's API is quite similar to SQLite's: https://github.com/cldellow/datasette-parquet And getting data from SQLite into DuckDB is trivial: https://duckdb.org/docs/guides/database_integration/sqlite.html |
Importing the data into DuckDB was indeed easy enough:
First observation: The DuckDB database only weighs 4.4GB. |
DataGrip also has basic support for DuckDB. Using:
|
Huh. I keep running into internal errors. For example:
Am I doing something wrong or did the import maybe fail? |
Seems like it was genuinely a problem with the import. I have now tried to adapt the But at least with this database, all queries work fine. And it definitely seems worth investigating further. All queries I've tried running have returned almost instantly. |
Okay, it was just a silly typo/c&p error. I had executed the regex on the wrong column ( Before I spotted the problem I tried registering a custom function: db.register_udf('url_host', 'text', (url) => {
try {
const u = new URL(url);
return u.hostname;
} catch {
return null;
}
}); Even with needing to parse all those URLs using JS, creating the DB took less than 90s. |
And the performance is still really encouraging! Even fetching the top 1000 endpoints is instantaneous: Screencast.from.2024-10-22.13-24-29.mp4 |
I did some testing with datasette-parquet as well. One note: You'll need to also install It's really fun to have the complex queries be instant. For example, the "most common endpoints on web" is "Queries took 80.526ms" vs "Queries took 21597.821ms". Tiny bit of a difference there. Unfortunately, I'm also running into a few errors. Some of these I've been able to fix with query changes etc. But this is definitely not a drop-in replacement. |
I'm definitely also seeing that DuckDB isn't better in all scenarios. For any sort of aggregation etc. it blows SQLite out of the water. But if you actually need to fetch the data from the rows, SQLite tends to be faster. Not much of a surprise given DuckDB's focus on analytical queries. Here's an example from the query we're using in TrackHAR: When fetching all the data, the difference isn't too bad. But when fetching a subset of rows, there is quite the difference: |
I think that's as far as it makes sense for me to go with this investigation for now.
What do you think, @zner0L? Should we spent more time on investigating DuckDB and trying to make it work for us? Or should we just do what we had already resigned to anyway (throw away everything other than the top 1k for web for now) and hope that Datasette one day offers native support for another DBMS? |
For reference: My changes for testing datasette-parquet are in #7. |
Oh, I also wanted to mention this blog post I found: https://simonwillison.net/2022/Sep/1/sqlite-duckdb-paper/ Most notably, it mentions that SQLite 3.38.0 ships with speed ups for large analytical queries. We're currently running 3.37.2 on the server. |
Alternative URL for checking the version: https://data.tweasel.org/-/versions |
3.37.2 appears to be the version included with the system Python on our server (https://stackoverflow.com/a/68589898). As far as I can tell, there is no other sqlite installed on the system or via pip.
|
I think we should focus on other things for now and go through with the deletion of data. We should maybe upload a dump to zenodo so that other people can still use it on their own machine if they want to. We could still try to update sqlite and see if that brings any improvement, though. |
At least for building tracker-wiki we do need aggregate queries after all. |
Notes on how I trimmed the dataset in: tweaselORG/experiments#3 (comment) Full dataset for the top 10k on Zenodo: https://zenodo.org/records/13990110 |
With that, the result of |
Running TrackHAR's But of course, it also returns a lot less data. :/ We now have 8,488 matching requests instead of ~75k. |
Deployed to data.tweasel.org, which is now usable again. |
And debugging the GA adapter against the remote Datasette instance only takes 2mins! |
Now that I have imported the data from our first traffic collection on the web (tweaselORG/experiments#3), the Datasette instance has gotten a lot more sluggish.
That shouldn't come as too much of a surprise. We have jumped from 323,116 rows in the
requests
table to 2,614,200 and the database has grown to more than 10 GB (from a little over 1 GB iirc).Probably most critically, I have had the
debug-adapter
tool in TrackHAR fail due to timeouts quite regularly since I have imported the data, and that is after I had already increased the timeout to 50s (08bbf19#diff-92a81eca627ad137f98388e8aeae273e0a4069fde8995897ab138b5fee849cdcR3).The text was updated successfully, but these errors were encountered: