This repository is not under active development and it has been merged into its parent repository, MTEnglish2Odia.
This repository is forked from MTEnglish2Odia repository. This repository contains English (Lang Code: en) to Odia/Oriya (Lang code: or) parallel corpus of phrases.
There are multiple files based on the source of the data.
Combined file: consolidated_full_corpus.txt
Sample structure:
<english phrase>||<odia phrase>
for example:
urban development planning||ସହରାଞ୍ଚଳ ବିକାଶ ଯୋଜନା
Family||ପରିବାର
- 4500+ cleaned en-or parallel pairs (growing every weekend)
- ~50,000 uncleaned pairs
- Apertium Wiki for Odia language
- Indic Languages Multilingual Parallel Corpus
- Anuvadaksh- An online existing English-Odia translator
- Wordnet for Odia
- English-Punjabi parallel corpus creation
- Wikipedia Data dump
- Open Parallel Corpus
- OdiEnCorp 1.0
- TDIL - Technical strings 52,000 pairs-Data needs to be cleaned
- Global Voices - 328 sentences pairs
- Mann ki baat - 1000+ pairs
- Twitter:DoctorBabu - Around 100 Botanical terms En-Or pairs
- Rupesh Ranjan Panda - Around 300 generic En-Or pairs
These are few places where relevant data may be present, however getting the data is not straight forward.
- EMILLE Project : The Oriya written corpus consists of data incorporated from the CIIL Corpus, originally gathered by the Institute of Applied Language Sciences, Bhubaneshwar (approximately 2,730,000 words).
- Gyan Nidhi-TDIL : Million pages’ multilingual parallel text corpus in English and 11 Indian languages (Assamese, Bengali, Gujarati, Hindi, Kannada, Marathi, Malayalam, Oriya, Punjabi, Tamil & Telugu) based on Unicode encoding. The Gyan Nidhi corpus contains the text in the form of books. In these books there were number of diagrams, figures, charts and other special symbols. These are removed from the text by using automated and manual tools. The text in gyan nidhi is in the form of paragraphs, that are converted into short sentences.
- You can send English-Odia word/phrase/sentence pairs on the below format in a new file, under your name and types of data. For e.g. if your name is Satyabrata, you want to upload generic phrases:
Key | Example |
---|---|
Filename | satyabrata.txt |
File upload path | data/Individual_files/satyabrata.txt |
File text format | `Why are you so lazy? |
Please make sure you have correct permissions to upload this data in GPL license.
- Tutorial on how to fork a repository and send a PR can be found in this video or this video or this Github doc tutorial for fork and this one for pull request
- Your Pull Request will be reviewed first.
- Please follow up if any comments or modifications are needed on your Pull Request.
- In case of any confusion please contact on [email protected]. You will get a response within a day or two.