Wanna be search engine with federation support
(artwork by mildtravis)
- guile 2.9.4
- guile-fibers 1.0.0
- guile-bytestructures 1.0.6
- guile-gcrypt 0.2.0
- guile-gnutls 3.6.9
- guile-arew
- wiredtiger 3.2.0-0
- stemmer 0.0.0
See my guix channel.
- wiredtiger bindings
-
srfi-128 (comparators), not required since(mapping hash)
was replaced withfash
-
srfi-146 (mappings hash), use fash - srfi-158 (generators)
- web server
- theme
- api stub
- pool of workers to execute blocking operations
- snowball stemmer bindings
- html2text
- okvs abstractions
- okvs (srfi-167)
-
pack
andunpack
-
<engine>
type class object - wiredtiger backend
- nstore (srfi-168)
- ulid
- make
thread-index
a global - move okvs abstractions inside okvs directory (fts, counter, nstore, ulid...)
- ulid store, rename object.scm to okvs/ustore.scm
- add tests to ulid.scm
- clean up: use with-directory from babelia/testing.scm
- mapping
- pack: support nested list
- multimap
- counter, requires mapping and thread-index
- crawl scheme world
- full-text search
- index
- replace anything that is not alphanumeric with a space, and filter out words strictly smaller than 2 or strictly bigger than 64,
- store each stem once in the index,
- every known stem is associated with a count, and sum to be able to compute tf-idf,
- every known word is associated with a count, and sum to be able to compute tf-idf,
- every stem is associated with the ulid.
- query
- parse query: KEY WORD -MINUS,
- validate that query is not only negation,
- seed with most discriminant stem,
- in parallel, compute score against bag of word
- keep top 30 results (configurable)
- index
- add
babelia index PATH
command to index html files - add
babelia search KEY WORD -MINUS
to search them
- logging library with colored output
- okvs/fts: consider all keywords
- okvs/wiredtiger: move the lock to the record
- parse query into a closure
- babelia words counter: sorted by count
- counter-fold
- babelia stem counter: sorted by count
- babelia stem stop update FILENAME: input text file with stop
word that must be ignored as seed candidates.
- mapping-clear via okvs-range-remove
- babelia stem stop guess DIRECTORY SECONDS: benchmark using a fresh database connection each stem from frequent to infrequent until it takes less than SECONDS to query. Output the stems that are slow.
- reject queries which seed is a stop word
- babelia web api secret generate --force: create file with the hex string of the secret.
- okvs abstraction: record store (rstore)
- web: guard all exception and return 500,
- okvs fts fts-index:
- input: html (with possibly microformats)
- output: three values: uid, title, and preview
- can raise babelia/index error with a reason.
- title: min 3, max 100 truncated
- text: min 280 chars, max ???
- create small preview: max 280 chars
- babelia web /api/index
- babelia web /api/search
- crawler:
- make-robots.txt user-agent string
- robots.txt-delay robots.txt path => #f or seconds,
- robots.txt-allow? robots.txt path => #f or #t,
- use nstore in separate directory
- babelia crawler run: same command but another processus.
- fiber main thread + workers
- babelia crawler add REMOTE URL:
- if has a path, if it is html and utf8 and not a redirection, index only the given URL
- otherwise, it is a domain:
- check that it is not a redirection,
- check that it is html and utf8,
- add linked pages to todo,
- index the given page,
- extract links and add to todo with domain,
- keep track of what is done and what is todo,
- add to the todo only if is html and utf8,
- web: input query
- web: display results
- move to R7RS https://git.sr.ht/~amz3/guile-arew
- scheme bitwise
- scheme bytevector
- scheme comparator
- scheme generator
- scheme hash-table
- scheme list
- scheme mapping
- scheme mapping hash
- scheme set
- guix: guile-build-system
- normalize query: remove useless whitespace to play nice with the cache
- log queries that take more that 5 seconds (configurable),
- babelia queries show: output slow queries,
- babelia cache update FILENAME
- babelia cache refresh
- babelia web api secret generate: encrypt the secret
- babelia web api secret show
- index: support structured documents
- guix package definition for dependencies,
- benchmark with scheme world dump, and commit the resulting,
- need to split the number of cores between wiredtiger and the app.
- Make thread-pool size configureable,
- okvs fts: maybe-index and reindex (delete + add)
- federation
- search pad
- babelia api secret show: add it
- babelia api secret generate: add it
- okvs/fts:
- OR support
- proximity bonus
- keyword weight
- one way synonyms
- two way synonyms
- phrase matching
- td-idf
- babelia crawler sitemap support
- babelia crawler wikimedia: use rest api, otherwise fallback to wiki.
- okvs nstore: improve prefix handling.
- spell checking
- sensimark
- okvs pack: optimize algorithm of nested list with a single pass
- okvs pack: past argument as a list instead of rest
- babelia index: warc file input
- babelia crawler: output warc file
- check.scm: make it possible to execute tests from low level to high level (or high level to low level)
- more validation, use R7RS raise and guard (to make the transaction fail),
- entity recognition,
- inbound links,
- domain or page outbound links,
- page rank.
- gumbo bindings https://github.com/google/gumbo-parser