This project is aiming to implement the patent classification at the subclass level
according to IPC and CPC systems. The total number of classes is more than 600.
The pipeline for the project implementation is as below:
- Extract dataset
- EDA of the dataset
- Train a model
For all of the above tasks, the respective jupyter notebook is shared.
With the Google big query, the dataset for the classification task is generated. The generated dataset is stored in the CSV file. For each year varying from the year, 2009 to 2019 separate CSV files are created. This dataset is made publically available for experiment purposes. The attribute of these CSV files are as shown in the table below:
ID | Date | Title | Claim | cpc_subclass |
---|---|---|---|---|
8844051 | 2014-09-23 | Lithium-ion secondary battery | A lithium-ion secondary battery comprising ... | H01M,Y02E,Y02T |
The link to download this dataset by year is provided below.