Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GridSearch for Walking and Sampling Strategies #107

Open
ChrisDelClea opened this issue Jul 21, 2022 · 5 comments
Open

GridSearch for Walking and Sampling Strategies #107

ChrisDelClea opened this issue Jul 21, 2022 · 5 comments
Labels
enhancement New feature or request

Comments

@ChrisDelClea
Copy link

🚀 Feature

Hi guys, first of all, thanks for this amazing package, i love it.
While doing some stuff with it, i asked myself what the best walking / sampling strategy would be.
To evaluate that, two things would be missing:

  • first) a way to evaluate the goodness of the embedding or if they really capture the semantics of the graph and
  • second) a GridSearch kind of class to run all of the different options.

Additional context

Solution

GridSearch Class a la sklearn would be awesome, is that possible?

@ChrisDelClea ChrisDelClea added the enhancement New feature or request label Jul 21, 2022
@GillesVandewiele
Copy link
Collaborator

Hi Chris,

Thank you for your suggestion, I agree that it would be a nice addition! However, it might be really difficult/expensive to run this, especially given the fact that some of these strategies have their own hyper-parameters as well. The hyper-parameters of the Embedding techniques (e.g. Word2Vec) might change per strategy as well (and even the hyper-parameters of the downstream ML model), and the possible combinations of walking + sampling strategies quickly grows.

@ChrisDelClea
Copy link
Author

I see, thanks for that.

I have have a few other questions/remarks/wishes:

  • Could you also implement or integrate other none-word2vec based kg emebedding algorithms in this framework? I know this request might sound strange first, but i could not find any framework that would take an RDF Graph as Input to generate embeddings for e.g. RotatrE and i love your API.
  • Also i wrote a function to get the most k similar entities out of the graph given an input entity. Would be nice to have something inbuild? transformer.most_similar(e1) => e2, e3, e4
  • Also automatic plotting of the embeddings with tensorboard or something out of the box would be really great.
  • In the Blogpost you wrote that numerical data is an issue, do you have any suggestions how to fix it? I am asking since i am trying to embedd a kg where i have a lot of numerical e.g. age, piz code etc.
  • Another question regarding your last point: RDF2Vec cannot deal with volatile data. In Machine Learning we normally want to embedd an unseen datapoint in the vectorspace and find most similar entities. So i was thinking what would happen if i call walk_embeddings = transformer.transform(unseen_entity)?
  • The blogpost was super helpful to understand the package. Just some things do not seem up to date anymore e.g. the skip_predicates vs. the scip_predicates. Also i was wondering if i had to list all literals as parameters?

Best regards
Chris

@GillesVandewiele
Copy link
Collaborator

I see, thanks for that.

I have have a few other questions/remarks/wishes:

* Could you also implement or integrate other none-word2vec based kg emebedding algorithms in this framework?  I know this request might sound strange first, but i could not find any framework that would take an RDF Graph as Input to generate embeddings for e.g. RotatrE and i love your API.

Yes, we tried to support that through the Embedder interface. You can implement your own embedding model (e.g. the fast-text implementation we provide)

* Also i wrote a function to get the most k similar entities out of the graph given an input entity. Would be nice to have something inbuild? `transformer.most_similar(e1) => e2, e3, e4`

I agree that this indeed would be nice! However, how would one define most_similar? Cosine distance comes to mind as a metric, but perhaps other people would want other metrics? This should be made configurable. But nevertheless, it is indeed an interesting suggestion!

* Also automatic plotting of the embeddings with tensorboard or something out of the box would be really great.

Do you mean plotting during the training procedure? That should be possible with gensim models indeed.

* In the Blogpost you wrote that numerical data is an issue, do you have any suggestions how to fix it? I am asking since i am trying to embedd a kg where i have a lot of numerical e.g. age, piz code etc.

We don't have a optimal fix yet, however, you can specify the paths to extract this numerical information from the KG. You could then embed this numerical data to your embeddings before fitting a downstream ML model on it (or taking these numbers into account for your similarity search as well).

* Another question regarding your last point: ` RDF2Vec cannot deal with volatile data`.  In Machine Learning we normally want to embedd an unseen datapoint in the vectorspace and find most similar entities. So i was thinking what would happen if i call `walk_embeddings = transformer.transform(unseen_entity)`?

A mechanism for updating the model has been implemented, If you set is_update=True to the fit method, it should update the model without re-training it entirely from scratch. Nevertheless, RDF2Vec is unfortunately a transductive (i.e. non-inductive) technique inherently, and these mechanisms are far from an optimal solution.

* The blogpost was super helpful to understand the package. Just some things do not seem up to date anymore e.g. the `skip_predicates` vs. the `scip_predicates`. Also i was wondering if i had to list all `literals` as parameters?

No, these are optional, but they might be useful to extract numerical information (per your comment above)

Best regards Chris

@ChrisDelClea
Copy link
Author

  • Yes Cosine or Euclidean would be great distance mesaure to start with.

  • In other frameworks they offer directly a method (e.g. embeddings.to_tensorboard() ) that exports the embeddings and suff in tensorboard format. Similar function would be helpful here ofc, too!

  • I still don't fully understand what the literals parameter is helpful for or when to use it. Could you explain a bit more about it?

  • Could you provide a full example of how to use is_update=True ? I wonder what input i had to pass, all data or only novel one?

@GillesVandewiele
Copy link
Collaborator

GillesVandewiele commented Nov 16, 2022

The feature requests are noted. I'll take a look if I ever find bandwidth :), any PRs are more than welcome as well of course.

Literals are needed to use numerical information for your ML model. Word2Vec internally uses a BoW representation for its tokens. 9 and 10 are in other words COMPLETELY different tokens and the numerical relation between them (the fact that they are actually quite similar) is lost. As such, you can get embeddings and literals together, concatenate them and feed it to your ML model. Example here

Here's an example of the is_update

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants