This repo contains my solution to building a concurrent scraper which generates a top-k list of word frequencies.
Execute the following from the root directory of the repo.\
go run cmd/scraper/main.go
You can also view help on commnands and flags (currently only the root command exists)
go run cmd/scraper/main.go --help
Note: when running in a terminal, the above will print stderr to terminal as well. To avoid seeing stderr at all, run
go run cmd/scraper/main.go 2>/dev/null
To specify configuration file, overriding the default, run
go run cmd/scraper/main.go --config ./cmd/scraper/scrapeconfig.yaml
See the provided cmd/scraper/config.yaml
for example
Note: the .job_loader.page_cutoff
parameter is set to 3, to avoid getting accidentally blocked during development.
For a real-world use case, either omit the attribute or set it to 0.
To set the log level, add a GO_LOG=<level>
envvar when executing, e.g.
GO_LOG=debug go run cmd/scraper/main.go
When log level is unspecified, the default is "info".
Execute the following from the root directory of the repo.
go test ./...
- Supprort multi-node architecture for large-scale scraping
- Bloom filter to support larger dictionaries
- Benchmarks
- Test core scraper flow control