WC parser

This repo contains my solution to building a concurrent scraper which generates a top-k list of word frequencies.

Run the CLI

Execute the following from the root directory of the repo.\

go run cmd/scraper/main.go

You can also view help on commnands and flags (currently only the root command exists)

go run cmd/scraper/main.go --help

Note: when running in a terminal, the above will print stderr to terminal as well. To avoid seeing stderr at all, run

 go run cmd/scraper/main.go 2>/dev/null

Specify configuration

To specify configuration file, overriding the default, run

 go run cmd/scraper/main.go --config ./cmd/scraper/scrapeconfig.yaml

See the provided cmd/scraper/config.yaml for example Note: the .job_loader.page_cutoff parameter is set to 3, to avoid getting accidentally blocked during development.
For a real-world use case, either omit the attribute or set it to 0.

Specify log level

To set the log level, add a GO_LOG=<level> envvar when executing, e.g.

GO_LOG=debug go run cmd/scraper/main.go

When log level is unspecified, the default is "info".

Run Tests

Execute the following from the root directory of the repo.

go test ./...

Future improvements

Supprort multi-node architecture for large-scale scraping
Bloom filter to support larger dictionaries
Benchmarks
Test core scraper flow control

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

WC parser

Run the CLI

Specify configuration

Specify log level

Run Tests

Future improvements

Files

README.md

Latest commit

History

README.md

File metadata and controls

WC parser

Run the CLI

Specify configuration

Specify log level

Run Tests

Future improvements