GitHub - Tim-Saijun/gpt-web-crawler: A web crawler for GPTs to build knowledge bases 用于GPT构建知识库的网站爬虫

Introduction

GPT-Web-Crawler is a web crawler based on python and puppeteer. It can crawl web pages and extract content (including WebPages' title,url,keywords,description,all text content,all images and screenshot) from web pages. It is very easy to use and can be used to crawl web pages and extract content from web pages in a few lines of code. It is very suitable for people who are not familiar with web crawling and want to use web crawling to extract content from web pages.

The output of the spider can be a json file, which can be easily converted to a csv file, imported into a database or building an AI agent.

Getting Started

Step1. Install the package.

pip install gpt-web-crawler

Step2. Copy config_template.py and rename it to config.py. Then, edit the config.py file to config the openai api key and other settings, if you need use ProSpider to help you extract content from web pages. If you don't need to use ai help you extract content from web pages, you can keep the config.py file unchanged.

Step3. Run the following code to start a spider.

from gpt_web_crawler import run_spider,NoobSpider
run_spider(NoobSpider, 
           max_page_count= 10 ,
           start_urls="https://www.jiecang.cn/", 
           output_file = "test_pakages.json",
           extract_rules= r'.*\.html' )

Spiders

In the above code, the NoobSpider is used. There are four spiders in the package, which are NoobSpider, CatSpider, ProSpider and LionSpider. They are different in the content they can extract from the web page. The following table shows the differences between them.

Spider Type	Description	Return Content
NoobSpider	Basic web page scraping	- titile - url - keywords - description - body :all text content of web page
CatSpider	Web page scraping with screenshots	- titile - url - keywords - description - body :all text content of web page - screenshot_path
ProSpider	Web page scraping with AI-extracted content	- titile - url - keywords - description - body :all text content of web page - ai_extract_content : gpt's extraction of body text
LionSpider	Web page scraping with all images extracted	- titile - url - keywords - description - body :all text content of web page - directory : the directory of all pics on web page

Cat Spider

Cat spider is a spider that can take screenshots of web pages. It is based on the Noob spider and uses puppeteer to simulate browser operations to take screenshots of the entire web page and save it as an image. So when you use the Cat spider, you need to install puppeteer first.You can refer to this answer to install npm, and then use the following command to install puppeteer:

npm install puppeteer

TODO

支持无需配置config.py
爬虫更多自定义内容

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.github/workflows		.github/workflows
gpt_web_crawler		gpt_web_crawler
images		images
templates		templates
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README-Zh.md		README-Zh.md
README.md		README.md
__init__.py		__init__.py
config_template.py		config_template.py
example.py		example.py
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Getting Started

Spiders

Cat Spider

TODO

About

Releases 2

Languages

License

Tim-Saijun/gpt-web-crawler

Folders and files

Latest commit

History

Repository files navigation

Introduction

Getting Started

Spiders

Cat Spider

TODO

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Languages