-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explore HDBSCAN
as a replacement for DBSCAN
in RGDR
#136
Comments
HDBSCAN
vs DBSCAN
HDBSCAN
as a replacement for DBSCAN
in RGDR
Cool! I stumbled upon this method a long time ago thinking I should revisit but I completely forgot! |
Okay guys, I have spent some time on this. HDBSCAN is in principle an improvement over DBSCAN, but I'm not really sure yet whether it is a real improvement for us. I'll give some explanation here. I can also give a presentation showing a notebook soon. The best improvement of HDBSCAN over DBSCAN is that it does not use one lambda parameter (the eps parameter) to determine the number of clusters. Instead, it maximizes the total sum of persistence of the clusters under the constraint that the chosen clusters are non-overlapping. Bit less formally: it looks if splitting one cluster into two results in more 'mass' than before. If it does, it splits the cluster. If it doesn't, it keeps it as one. That way, it determines the lambda parameters itself. As promised, the only parameter that needs tuning is minimum cluster size. It is intuitive to use, because you can indicate that you only want clusters with size > 5 cells. This is arguably better than the eps_km parameter because it requires the user to have some idea about the size of the data. However, although this parameter is easy to use, it can also lead to some 'cutoff' scenarios where there are only regions found with minimum_cluster_size <5, so setting it to the (default) 5 leads to no regions being found at all. So does it lead to more robust clusters? I don't know to be honest. I also tested the speed in the notebook and it does not look like HDBSCAN is much faster than DBSCAN. It was actually slower in my case. We (@semvijverberg and @geek-yang ) discussed this already a bit, and one way to proceed could be to use HDBSCAN with minimum_cluster_size is 2 (the lowest setting) and then use @BSchilperoort his extra layer of removing areas with min_area_km2. Maybe we could also look at the correlation of the ts between regions like @semvijverberg has suggested. |
Thanks for exploring this, Jannes! I have some questions!
|
Just saw your post. We discussed the results last Wednesday.
Jannes tested it on a larger dataset with higher resolution. But the results are similar to those with coarse data.
@jannesvaningen Can you comment on it?
The clusters are similar in general, though some details are different. But as DBSCAN, the results are not very robust, especially for those edge points. These methods are designed to cluster data based on the density, which is actually the difference in distance. However, since our data is on structured grid, it is difficult in some cases. We might be able to get robust results with unevenly distributed data, I guess. Actually for ocean modelling, their data is always on unstructured grid. Maybe we can test our methods using some oceanic reanalysis data, e.g. ORAS5, SODA3. Anyway, I think HDBSCAN is a nice option to add, at least we provide an alternative for the user. |
I recently stumbled upon the alternative clustering method
HDBSCAN
. They promise the following:And also:
Not only this, but it seems to be basically a drop-in replacement of DBSCAN which we currently use, so this could be quite interesting to explore to make RGDR more robust as well as perform better.
The text was updated successfully, but these errors were encountered: