WC parser

This repo contains my solution to building a concurrent scraper which generates a top-k list of word frequencies.

Run the CLI

Execute the following from the root directory of the repo.\

go run cmd/scraper/main.go

You can also view help on commnands and flags (currently only the root command exists)

go run cmd/scraper/main.go --help

Note: when running in a terminal, the above will print stderr to terminal as well. To avoid seeing stderr at all, run

 go run cmd/scraper/main.go 2>/dev/null

Specify configuration

To specify configuration file, overriding the default, run

 go run cmd/scraper/main.go --config ./cmd/scraper/scrapeconfig.yaml

See the provided cmd/scraper/config.yaml for example Note: the .job_loader.page_cutoff parameter is set to 3, to avoid getting accidentally blocked during development.
For a real-world use case, either omit the attribute or set it to 0.

Specify log level

To set the log level, add a GO_LOG=<level> envvar when executing, e.g.

GO_LOG=debug go run cmd/scraper/main.go

When log level is unspecified, the default is "info".

Run Tests

Execute the following from the root directory of the repo.

go test ./...

Future improvements

Supprort multi-node architecture for large-scale scraping
Bloom filter to support larger dictionaries
Benchmarks
Test core scraper flow control

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
cmd		cmd
internal		internal
.gitignore		.gitignore
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main		main

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WC parser

Run the CLI

Specify configuration

Specify log level

Run Tests

Future improvements

About

Releases

Packages

Languages

yanilov/wc-scraper

Folders and files

Latest commit

History

Repository files navigation

WC parser

Run the CLI

Specify configuration

Specify log level

Run Tests

Future improvements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages