The Serene Benchmark project provides a common framework to evaluate different approaches for schema matching and mapping. Currently, the framework supports evaluation of three approaches for semantic labeling of relational data sources.
To run evaluation benchmark,
- The server for Karma DSL needs to be installed and started.
- The server for Serene needs to be started.
- Neural nets: tensorflow and keras python packages.
- Serene Python client should be installed.
Decompress sources and labels in the data folder.
To install the package 'serene-benchmark', run
python setup.py install
There are three different approaches for semantic typing which can be currently evaluated in this project:
- DSL: domain-independent semantic labeller
- DINT: relational data integrator
- NNet: deep neural nets MLP and CNN. Additionally, there is an implementation of a random forest using scikit.
For NNetModel, allowed model types are: 'cnn@charseq' (CNN on character sequences) and 'mlp@charfreq' (MLP on character freqs + entropy). There is also 'rf@charfreq' which uses scikit implementation of random forests while DINT uses Spark mllib.
DINT feature configuration is explained here, and resampling strategy here.
DSL can be run in two modes: "normal" when labeled data is used only from one domain or "enhanced" when labeled data from other domains is used. Its approach is exaplained in the paper.
benchmark.py provides a running example of two experiments to evaluate different approaches for semantic labeling.
To add another semantic labeling approach to the benchmark, one needs to create a class which inherits from SemanticTyper and implements 4 methods: degifine_training_data, train, reset and predict. One can also modify initialization of the class instance.
Nose unittests needs to be installed. To run the tests:
nosetests