Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError: expected np.ndarray (got Tensor) #1431

Open
topl0305 opened this issue Nov 6, 2024 · 3 comments
Open

TypeError: expected np.ndarray (got Tensor) #1431

topl0305 opened this issue Nov 6, 2024 · 3 comments
Labels

Comments

@topl0305
Copy link

topl0305 commented Nov 6, 2024

Describe the bug
Was trying to use pretrained model https://huggingface.co/stanfordnlp/stanza-lt
With a lot of issues, like stanza.download("lt") constantly crashing, I was forced to do it manually. So, installed and downloaded everything and used next piece of code to get the bug

import stanza
config = {
'processors': 'tokenize,pos',
'lang': 'lt',
'tokenize_model_path': './stanza_resources/lt/tokenize/alksnis.pt',
'pos_model_path': './stanza_resources/lt/pos/alksnis_nocharlm.pt',
'pos_pretrain_path': './stanza_resources/lt/pretrain/fasttextwiki.pt',
'tokenize_pretokenized': True,
'download_method': None
}

nlp = stanza.Pipeline(**config) # initialize neural pipeline
doc = nlp("Kur einam mes su Knysliuku, didžiulė paslaptis") # run annotation over a sentence
print(doc)

Expected behavior
The result shoud be obvious:

[
[
{
"id": 1,
"text": "Kur",
"upos": "ADV",
"xpos": "prm.l.lrgin.",
"feats": "Degree=Pos|PronType=Int,Rel",
"misc": "",
"start_char": 0,
"end_char": 3
},
...
]

Environment (please complete the following information):

  • OS: Windows 10
  • Python 3.10.5
  • stanza 1.9.2
  • numpy 2.1.2

Additional context
At least it works after patching code in file stanza/models/pos/model.py
~90 line self.add_unsaved_module('pretrained_emb', nn.Embedding.from_pretrained(torch.from_numpy(emb_matrix), freeze=True))
to

if type(emb_matrix) == torch.Tensor:
self.add_unsaved_module('pretrained_emb', nn.Embedding.from_pretrained(emb_matrix, freeze=True))
else:
self.add_unsaved_module('pretrained_emb', nn.Embedding.from_pretrained(torch.from_numpy(emb_matrix), freeze=True))

Not sure who is culprit - library or model.

@topl0305 topl0305 added the bug label Nov 6, 2024
@AngledLuffa
Copy link
Collaborator

ultimately the problem here is we modified the models for the upcoming version 1.10, and you're downloading the new models with the old code. you could use the dev branch or download the version 1.9 models directly from HF if you're sure you need to do it manually

With a lot of issues, like stanza.download("lt") constantly crashing, I was forced to do it manually.

"crashing" how? like with a bad connection? it doesn't "crash" when i run it

you also don't need to do any of that

just run

nlp = stanza.Pipeline("lt", processors="tokenize,pos", tokenize_pretokenized=True)

it should automatically download just the models you need for the right version

@topl0305
Copy link
Author

topl0305 commented Nov 8, 2024

If I'm using direct download - stanza.download('lt')
I get next error

Traceback (most recent call last):
  File "C:/Users/***/Desktop/test_nlp.py", line 2, in <module>
    stanza.download('lt') # download English model
  File "C:\Users\***\AppData\Local\Programs\Python\Python310\lib\site-packages\stanza\resources\common.py", line 599, in download
    request_file(
  File "C:\Users\***\AppData\Local\Programs\Python\Python310\lib\site-packages\stanza\resources\common.py", line 159, in request_file
    assert_file_exists(path, md5, alternate_md5)
  File "C:\Users\***\AppData\Local\Programs\Python\Python310\lib\site-packages\stanza\resources\common.py", line 112, in assert_file_exists
    raise ValueError("md5 for %s is %s, expected %s" % (path, file_md5, md5))
ValueError: md5 for C:\Users\***\stanza_resources\lt\default.zip is 36e9cd4989fac42001d585dc514c2020, expected 3b1725c28eeed0cdf734bd92ec82f927

This is log file:
log.txt

Was testing your suggestion -- nlp = stanza.Pipeline("lt", processors="tokenize,pos", tokenize_pretokenized=True)

Traceback (most recent call last):
  File "C:/Users/***/Desktop/test_nlp.py", line 7, in <module>
    nlp = stanza.Pipeline("lt", processors="tokenize,pos", tokenize_pretokenized=True)
  File "C:\Users\***\AppData\Local\Programs\Python\Python310\lib\site-packages\stanza\pipeline\core.py", line 252, in __init__
    download_models(download_list,
  File "C:\Users\***\AppData\Local\Programs\Python\Python310\lib\site-packages\stanza\resources\common.py", line 540, in download_models
    request_file(
  File "C:\Users\***\AppData\Local\Programs\Python\Python310\lib\site-packages\stanza\resources\common.py", line 159, in request_file
    assert_file_exists(path, md5, alternate_md5)
  File "C:\Users\***\AppData\Local\Programs\Python\Python310\lib\site-packages\stanza\resources\common.py", line 112, in assert_file_exists
    raise ValueError("md5 for %s is %s, expected %s" % (path, file_md5, md5))
ValueError: md5 for C:\Users\***\stanza_resources\lt\pretrain\fasttextwiki.pt is 6996c18339716076308d957354340a61, expected 89420a04d9c0b31feb5598e17eb52f8f

@AngledLuffa
Copy link
Collaborator

That's pretty weird. If I use the github repo main branch (which is 1.9.2), download successfully downloads a file with the following md5sum, which is the expected value:

[john@localhost stanza]$ md5sum /home/john/stanza_resources/lt/default.zip
3b1725c28eeed0cdf734bd92ec82f927  /home/john/stanza_resources/lt/default.zip

I can switch branches back & forth between main & dev, and it overwrites the old models when trying to download again. At no point does it download a model with md5sum 36e9cd4989fac42001d585dc514c2020 This works on both Linux and Windows

Is it possible the download was interrupted and it got a corrupted file?

At any rate, I suggest deleting those incorrect files and trying again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants