GitHub - CMHLP/siren: Webscrapers and tools for the SIREN project

E-Papers (siren/scrapers/epaper)

Online Publications

Mirror
News Minute
Telegraph
Quint/Week/Wire/Scroll

Benchmarks

Device	CPU (model, cores+threads, clock)	RAM (size, speed)	GPU (model, vram, clock)
PC	Ryzen 7 3700x, 8+16, 4.2GHz	32GB DDR4 @ 3000MHz	RX 5700 XT, 8GB, 1750MHz
Laptop	Ryzen 3 3250U, 2+4, 2.6GHz	6GB DDR4 @ 2400 MHz	none (integrated Vega 3)
Raspi 5	ARM Cortex A-76, 4+0, 2.4GHz	8GB DDR4 @ 4267 MHz	VideoCore VII, -, 800MHz
Github Actions*	AMD EPYC 7763, 4, 2.2GHz	16GB DDR4 @ unknown	none

Device	Pytesseract (Single-threaded)	EasyOCR (Single-threaded)	Pytesseract (Multithreaded)	EasyOCR (Multithreaded)
PC	143.14	804.62	41	564.27
Laptop	933.23	1749.2	788.3	1592.45
Raspi 5	619.41	1508.53	461.13	1228
Github Actions	300	1486.82	Process Timeout	1180.09

Interestingly, using too many threads can result in worse performance than a single-threaded run. This is likely because of the increased context-switching and resource sharing across the threads. Python's default ThreadPoolExecutor has a default of (min(32, os.cpu_count() + 4) max threads. On the PC, this would be 16 threads + 4 = 20 max threads. However, this is not preferable, as the OCR engines are executed individually on each thread; which is a massive waste of resources. From testing, (os.cpu_count() + 4) // 4 has achieved better results across devices.

Additionally, none of my devices have a CUDA GPUs; this would significantly improve performance for EasyOCR, which has optional GPU support.

Github Actions also has a 4-core limit as per their documentation. However, while EasyOCR gets a performance benefit from multithreading, pytesseract always seems to cause the process to crash.

There are around 70 seperate editions covered by the scraper suite. Not all of these use OCR, so we should be able to comfortably use Actions for automation.

Name		Name	Last commit message	Last commit date
Latest commit History 117 Commits
.github/workflows		.github/workflows
siren		siren
.gitignore		.gitignore
README.md		README.md
config.toml		config.toml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
template.yml		template.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benchmarks

About

Contributors 2

Languages

CMHLP/siren

Folders and files

Latest commit

History

Repository files navigation

Benchmarks

About

Resources

Stars

Watchers

Forks

Contributors 2

Languages