A simple machine- and human-readable complete collection of all emoji, with keywords in all available languages, kept current directly from Unicode releases.
This repository is updated directly from The Unicode Consortium's latest CLDR (Common Local Data Repository) which Unicode provides for internationalization in general, but which also happens to provide handy internationalized keywords for emoji.
It is maintained for anyone who needs easily-parsable emoji in one of several formats, and directly for use in my own Splatmoji, the Linux desktop emoji/emoticon picker.
Not much to it:
git clone https://github.com/cspeterson/splatmoji-emojidata.git
# Now you have the data! 😀
If you're just looking for the obvious English-language full emoji collections, those would be either:
data/{json,tsv,yaml}/en.all.{json,tsv,yaml}
: the full base set of emoji with English-language annotations.
This repo includes all of the annotated emoji per available language.
The emoji are sorted according to Unicode sorting rules.
The same data is made available in json, tsv, and yaml.
The files are named according to the the source CLDR data files using standard language and locale identifiers, e.g. en_GB
is "British English," de_CH
is "Swiss Standard German," etc.
# A sampling of the data directories
├── data
│ ├── json
│ ├── [...]
│ ├── en.json
│ ├── en_AU.json
│ ├── en_CA.json
│ ├── en_GB.json
│ └── [...]
│ ├── tsv
│ ├── [...]
│ ├── en.tsv
│ ├── en_AU.tsv
│ ├── en_CA.tsv
│ ├── en_GB.tsv
│ └── [...]
│ └── yaml
│ ├── [...]
│ ├── en.yaml
│ ├── en_AU.yaml
│ ├── en_CA.yaml
│ ├── en_GB.yaml
│ └── [...]
As for the individual formats, they all contain the same data in as near the same format as possible:
JSON:
{
"🤓": [
"face",
"geek",
"nerd"
"nerd face",
],
}
TSV:
🤓 face, geek, nerd, nerd face
YAML:
🤓:
- face
- geek
- nerd
- nerd face
There is some delay after a Unicode release before the annotations and translations are released in a new version of CLDR. During this period you should see a branch like release-nn-beta
for testing the upcoming release.
Expect this branch to make it to master soon after a CLDR official release.
The major version of this package is always based on the Unicode CLDR version from which is sourced. The minor and patch versions should follow Semver 2.0.0 conventions.
The repo should be up to date with the latest CLDR from Unicode, but if for some reason you need regenerate the data...
For the python script that does the conversion:
- python 3
For the Bash script that automates the whole process:
- bash
- wget
- unzip
If you use virtual environments:
python3 -m venv venv
source venv/bin/activate
pip3 install -r requirements.txt
Or if you don't use virtual environments but should:
pip3 install -r requirements.txt
# With no arguments, this script will fetch and convert directly from the latest CLDR zip and other files:
bin/convert_all_cldr
# If you have a particular version on hand:
bin/convert_all_cldr -c /path/to/cldr/core.zip -s /path/to/emoji-variation-sequences.txt -o /path/to/emoji-ordering-rules.txt
The Unicode Consortium has kind of already done the contributing by making the CLDR data available, but I'm open to any improvements or suggestions. 🙂
This repository was created and is maintained by Christopher Peterson for use in Splatmoji.
Also, if you're here you're probably a nerd of some variety and should definitely also check out the awesome podcasts Decipher SciFi and Decipher History! 🤓
CLDR data files included in this repository in the lib/
directory are distributed under the Unicode Data License.
The rest of the code and data in this repository are distributed under the Apache license