Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rules-applier segmentation fault, stalling #17

Open
MemduhG opened this issue Apr 19, 2019 · 17 comments
Open

rules-applier segmentation fault, stalling #17

MemduhG opened this issue Apr 19, 2019 · 17 comments

Comments

@MemduhG
Copy link
Collaborator

MemduhG commented Apr 19, 2019

I am trying to train the ambiguous system on Ubuntu 18.04, working in the Kyrgyz to Turkish (kir->tur) direction. I am using Wikipedia dump files and following along the

When I first ran rules-applier, it output just a 0, and did not stop after 4-5 hours after which I cancelled it.

'/home/memduh/git/apertium-ambiguous/src'/rules-applier 'ky_KG' '../../apertium-tur-kir.kir-tur.t1x' sentences.txt lextor.txt rulesOut.txt
0

Upon running it again it began to produce output like:

161150
ruleId=0
tokId=0 , out = ^unknown<unknown>{^*Новосибир$}$
tokId=1 , out = ^default<default>{^milli<adj>$}$
tokId=2 , out = ^default<default>{^üniversite<n><px3sp><acc>$}$
tokId=3 , out = ^unknown<unknown>{^*Новосибирск$}$
tokId=4 , out = ^unknown<unknown>{^*ш$}$
tokId=5 , out = ^default<default>{^.<sent>$}$


tokId = 0 : *Новосибир
ruleId = 0; patNum = 1

tokId = 1 : milli
ruleId = 0; patNum = 1

tokId = 2 : üniversite
ruleId = 0; patNum = 1

tokId = 3 : *Новосибирск
ruleId = 0; patNum = 1

tokId = 4 : *ш
ruleId = 0; patNum = 1

tokId = 5 : .
ruleId = 0; patNum = 1

tok=0; rul=0; pat=1 - tok=1; rul=0; pat=1 - tok=2; rul=0; pat=1 - tok=3; rul=0; pat=1 - tok=4; rul=0; pat=1 - tok=5; rul=0; pat=1 - 

However when I cancelled this and ran it again, it started to give me segmentation faults, like below:

'/home/memduh/git/apertium-ambiguous/src'/rules-applier 'ky_KG' '../../apertium-tur-kir.kir-tur.t1x' sentences.txt lextor.txt rulesOut.txt
0
Makefile:48: recipe for target 'rulesOut.txt' failed
make: *** [rulesOut.txt] Segmentation fault (core dumped)
make: *** Deleting file 'rulesOut.txt'

Waiting for a while and trying again, it now seems to be back in the first situation, stalling with no output. Should it be producing/writing out the output as it goes, or is there some long training period before it does this?

@aboelhamd
Copy link
Collaborator

Hi Memduh,

I think you are using a copy where we didn't remove the debugging traces, sorry for that.

For the errors in ubuntu 18, it gave me the same behavior too, and I think it's because the compiler version.
While debugging, it complained about some hash maps accesses and since we were working on other issues, I just worked on other machine with ubuntu 16. I didn't try to downgrade or to solve the problem.
Also for the results I see, where there is no rules applied to any of the 5 tokens, I think there is something wrong with the transfer file, did you put id attribute in the rule element ?

For the run time of rules applier, I think it should take less than 10 minutes for a 1Mb file.
For the segementer, did you use our script or did you use it by your own, because there are some undesired characters we remove also with the segmenter script.

@MemduhG
Copy link
Collaborator Author

MemduhG commented Apr 19, 2019

FYI my input file is like 114 megabytes. I used the Kazakh pragmatic segmenter for the Kyrgyz wiki (with the .rb script from your repo) and it seemed to work quite alright, though I am not sure if there are any characters that are causing problems. The rules only have comments, not IDs, they are defined like <rule comment="regla: nom-noflex">, I should add IDs to them in the t1x file if that is necessary.

I tried again with a file containing the first thousand lines of the sentences, and after some attempts the system fairly quickly output something with results for all of the original corpus that I had originally tried (all 948096 lines), which makes me wonder if the program is copying/caching the files it uses somewhere?

All the rules it chooses are "default" but I will try again once I assign IDs.

@sevilaybayatli
Copy link
Owner

I dont know if Kazakh segmenter(we prepare the Kazakh segmenter depend on just Kazakh data) work well with Kyrgyz too, you should add rules IDs it is necessary.

@aboelhamd
Copy link
Collaborator

Yes, it's necessary to add ids. There is a script that add ids to the rules here

For the caching problem, the program doesn't cache or copy the files it uses, honestly I don't know why it outputs results of the original file, may be you entered the sentences file with 1000 sentence but the lextor with the 948096 sentences ?

Theoretically, the program should work for very large files, but I prefer splitting files into 10Mbs files or so, it's easier to debug and less effected and less prone for faults.

If you can wait for an hour or so, I will provide you another rules-applier file without the debugging traces.

@MemduhG
Copy link
Collaborator Author

MemduhG commented Apr 19, 2019

Ah I see, yes the makefile I wrote didnt regenerate all of that of course. I will add IDs, start again with a more reasonable file size and see what happens.

But when executing rules-applier, should it immediately begin to output the debug stuff, or just wait for a long time? I guess it was only working correctly when it was outputting them?

@aboelhamd
Copy link
Collaborator

Yes, your guess is correct.

@aboelhamd
Copy link
Collaborator

You can use this file instead. And also use the yasmet-formatter file in the same repo.

They have one change beside removing the debug traces, that's the sentences file is not needed in the input, because it has no use actually.

Try again and keep us updated with your results.

@sevilaybayatli
Copy link
Owner

sevilaybayatli commented Apr 20, 2019 via email

@sevilaybayatli
Copy link
Owner

According into the rules-applier, it is updated in this repository.

@MemduhG
Copy link
Collaborator Author

MemduhG commented Apr 22, 2019

You can use this file instead. And also use the yasmet-formatter file in the same repo.

They have one change beside removing the debug traces, that's the sentences file is not needed in the input, because it has no use actually.

Try again and keep us updated with your results.

I added IDs and got segfaults very often,

'/home/memduh/git/apertium-ambiguous/src'/rules-applier 'ky_KG' '../../apertium-tur-kir.kir-tur.t1x' sentences.txt lextor.txt rulesOut.txt
0
Makefile:48: recipe for target 'rulesOut.txt' failed
make: *** [rulesOut.txt] Segmentation fault (core dumped)
make: *** Deleting file 'rulesOut.txt'

Trying it enough times makes it work eventually, but it seems completely random when it will work or not. I think I am doing something wrong with the IDs, I have added them as one can see in https://github.com/apertium/apertium-tur-kir/blob/master/apertium-tur-kir.kir-tur.t1x but I keep getting the 0 ruleID issue, and the output file is full of "defaults", as you can see here: https://termbin.com/xjiv

@sevilaybayatli
Copy link
Owner

sevilaybayatli commented Apr 22, 2019 via email

@sevilaybayatli
Copy link
Owner

sevilaybayatli commented Apr 22, 2019 via email

@MemduhG
Copy link
Collaborator Author

MemduhG commented Apr 22, 2019 via email

@sevilaybayatli
Copy link
Owner

It worked for me without any problem for both language pairs, kaz-tur and spa-eng, As I said try run rules-applier in same directory not using path, btw rules-applier also updated in this repository.

and be sure you are using the right transfer file ones with ids

@aboelhamd
Copy link
Collaborator

Hi, Memduh.
I still think this arbitrary behaviour is because of the compiler version. Because by now you have done everything right.
May be you should try to downgrade the compiler to Ubuntu 16 's version.
Today isA I will try running it on some kir text, may be there is a bug with generalisation with other pairs.

@sevilaybayatli
Copy link
Owner

sevilaybayatli commented Apr 22, 2019 via email

@aboelhamd
Copy link
Collaborator

Hi @MemduhG , I tried running rules-applier on the same corpus you have, and I got "default" almost everywhere. In the next few days, I will try to debug the code to find what's the problem. I will keep you updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants