YouCrawl

Go crawl library

Install

go get -u github.com/allentom/youcrawl

Features

HTML parser : PuerkitoBio/goquery

Workflow

The yellow part will be executed in parallel

The crawler library contains the following components, which can be added as needed

The simplest example

Because most of the components are optional, no complicated code is required.

func main() {
    e := youcrawl.NewEngine(
		&youcrawl.EngineOption{
			// Up to 5 tasks at the same time
			MaxRequest: 5,
		},
	)
  e.AddURLs("http://www.example.com")
  e.RunAndWait()
}

The above code just request the web page. The following code will add some components to show more features, the code will be a little complicated.

We will collect data from website and save into json file

func main() {
    e := youcrawl.NewEngine(
		&youcrawl.EngineOption{
			// Up to 5 tasks at the same time
			MaxRequest: 5,
		},
    )
    // add url
    e.AddURLs("http://www.example.com")
    // add UserAgent Middleware, add random UserAgent when requested
    e.UseMiddleware(&youcrawl.UserAgentMiddleware{})
    // Add parser and get page title
	e.AddHTMLParser(func(doc *goquery.Document, ctx *youcrawl.Context) error {
		title := doc.Find("title").Text()
		fmt.Println(title)
		ctx.Item.SetValue("title", title)
		return nil
    })
    // add Pipeline to store the item to the items in the GlobalStore
    e.AddPipelines(&youcrawl.GlobalStorePipeline{})
    // write the data under the `items` field in GlobalStore to the json file
	e.AddPostProcess(&youcrawl.OutputJsonPostProcess{
		StorePath: "./output.json",
	})
  e.RunAndWait()
}

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
doc		doc
other		other
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
cookie.go		cookie.go
cookie_test.go		cookie_test.go
engine.go		engine.go
engine_test.go		engine_test.go
go.mod		go.mod
go.sum		go.sum
item.go		item.go
item_test.go		item_test.go
json.go		json.go
json_test.go		json_test.go
middleware.go		middleware.go
middleware_test.go		middleware_test.go
parser.go		parser.go
parser_test.go		parser_test.go
pipeline.go		pipeline.go
pipeline_test.go		pipeline_test.go
plugin.go		plugin.go
plugin_test.go		plugin_test.go
pool.go		pool.go
pool_test.go		pool_test.go
postprocess.go		postprocess.go
postprocess_test.go		postprocess_test.go
proxy.txt		proxy.txt
proxy_test.go		proxy_test.go
request.go		request.go
store.go		store.go
store_test.go		store_test.go
ua.txt		ua.txt
ua_test.go		ua_test.go
utils.go		utils.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

YouCrawl

Install

Features

Workflow

The simplest example

License

About

Releases

Packages

Languages

License

AllenTom/YouCrawl

Folders and files

Latest commit

History

Repository files navigation

YouCrawl

Install

Features

Workflow

The simplest example

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages