Türkçe GloVe - Repository for Turkish GloVe Word Embeddings
We used official GloVe repository both to create word embeddings and evaluation. GloVe Github Repository
-
570K Vocab, cased, 300d vectors, 1.6 GB Text, 2.6 GB Binary link
-
253K Vocab, uncased, 300d vectors, 720 MB Text 1.2 GB Binary link:
Corpus collected from January-December 2018 Commoncrawl.
This corpus has 2,736B tokens.
Corpus size: 5.4GB
Corpus Link
Paper Link
This benchmark dataset is used for intrinsic evaluation on analogy task. We used synonyms, capitals, and antonyms for analogy task. Benchmark Dataset Link
Semantic Evaluation | Antonyms Analogy Task | Capitals Analogy Task | Synonyms Analogy Task | Total Accuracy |
---|---|---|---|---|
GloVe Uncased | 21.70 | 47.74 | 19.48 | 27.88 |
This dataset is used for extrinsic evaluation on text categorization. The dataset has 7 different classes.
SVC | Logistic Regression | |
---|---|---|
GloVe Cased | 0.89306 | 0.89959 |
GloVe Uncased | 0.89956 | 0.90530 |
SVC | Logistic Regression | |
---|---|---|
GloVe Cased | 0.89388 | 0.89864 |
GloVe Uncased | 0.90015 | 0.90619 |
SVC | Logistic Regression | |
---|---|---|
GloVe Cased | 0.89306 | 0.89796 |
GloVe Uncased | 0.89959 | 0.90531 |
We used the given machine learning techniques with default hyperparameters in scikit-learn.
Text Categorization Dataset Link
model.most_similar(positive=['fransa', 'berlin'], negative=['almanya'])
model.most_similar(positive=['geliyor', 'gitmek'], negative=['gelmek'])
model.most_similar("kedi")
https://cs224d.stanford.edu/lecture_notes/notes2.pdf
https://nlp.stanford.edu/pubs/glove.pdf