Skip to content
This repository has been archived by the owner on Mar 31, 2022. It is now read-only.

amirouche/guile-babelia

Repository files navigation

guile-babelia

Wanna be search engine with federation support

babel tower beamed by an alien spaceship

(artwork by mildtravis)

Dependencies

  • guile 2.9.4
  • guile-fibers 1.0.0
  • guile-bytestructures 1.0.6
  • guile-gcrypt 0.2.0
  • guile-gnutls 3.6.9
  • guile-arew
  • wiredtiger 3.2.0-0
  • stemmer 0.0.0

See my guix channel.

v0.1.0

  • wiredtiger bindings
  • srfi-128 (comparators), not required since (mapping hash) was replaced with fash
  • srfi-146 (mappings hash), use fash
  • srfi-158 (generators)
  • web server
  • theme
  • api stub
  • pool of workers to execute blocking operations
  • snowball stemmer bindings
  • html2text
  • okvs abstractions
    • okvs (srfi-167)
    • pack and unpack
    • <engine> type class object
    • wiredtiger backend
    • nstore (srfi-168)
    • ulid
    • make thread-index a global
    • move okvs abstractions inside okvs directory (fts, counter, nstore, ulid...)
    • ulid store, rename object.scm to okvs/ustore.scm
    • add tests to ulid.scm
    • clean up: use with-directory from babelia/testing.scm
    • mapping
    • pack: support nested list
    • multimap
    • counter, requires mapping and thread-index
    • crawl scheme world
    • full-text search
      • index
        • replace anything that is not alphanumeric with a space, and filter out words strictly smaller than 2 or strictly bigger than 64,
        • store each stem once in the index,
        • every known stem is associated with a count, and sum to be able to compute tf-idf,
        • every known word is associated with a count, and sum to be able to compute tf-idf,
        • every stem is associated with the ulid.
      • query
        • parse query: KEY WORD -MINUS,
        • validate that query is not only negation,
        • seed with most discriminant stem,
        • in parallel, compute score against bag of word
        • keep top 30 results (configurable)
  • add babelia index PATH command to index html files
  • add babelia search KEY WORD -MINUS to search them

v0.2.0

  • logging library with colored output
  • okvs/fts: consider all keywords
  • okvs/wiredtiger: move the lock to the record
  • parse query into a closure
  • babelia words counter: sorted by count
    • counter-fold
  • babelia stem counter: sorted by count
  • babelia stem stop update FILENAME: input text file with stop word that must be ignored as seed candidates.
    • mapping-clear via okvs-range-remove
  • babelia stem stop guess DIRECTORY SECONDS: benchmark using a fresh database connection each stem from frequent to infrequent until it takes less than SECONDS to query. Output the stems that are slow.
  • reject queries which seed is a stop word
  • babelia web api secret generate --force: create file with the hex string of the secret.
  • okvs abstraction: record store (rstore)
  • web: guard all exception and return 500,
  • okvs fts fts-index:
    • input: html (with possibly microformats)
    • output: three values: uid, title, and preview
    • can raise babelia/index error with a reason.
    • title: min 3, max 100 truncated
    • text: min 280 chars, max ???
    • create small preview: max 280 chars
  • babelia web /api/index
  • babelia web /api/search
  • crawler:
    • make-robots.txt user-agent string
    • robots.txt-delay robots.txt path => #f or seconds,
    • robots.txt-allow? robots.txt path => #f or #t,
    • use nstore in separate directory
    • babelia crawler run: same command but another processus.
    • fiber main thread + workers
    • babelia crawler add REMOTE URL:
      • if has a path, if it is html and utf8 and not a redirection, index only the given URL
      • otherwise, it is a domain:
        • check that it is not a redirection,
        • check that it is html and utf8,
        • add linked pages to todo,
        • index the given page,
        • extract links and add to todo with domain,
    • keep track of what is done and what is todo,
    • add to the todo only if is html and utf8,
  • web: input query
  • web: display results

v0.3.0

  • move to R7RS https://git.sr.ht/~amz3/guile-arew
    • scheme bitwise
    • scheme bytevector
    • scheme comparator
    • scheme generator
    • scheme hash-table
    • scheme list
    • scheme mapping
    • scheme mapping hash
    • scheme set
    • guix: guile-build-system
  • normalize query: remove useless whitespace to play nice with the cache
  • log queries that take more that 5 seconds (configurable),
  • babelia queries show: output slow queries,
  • babelia cache update FILENAME
  • babelia cache refresh
  • babelia web api secret generate: encrypt the secret
  • babelia web api secret show
  • index: support structured documents
  • guix package definition for dependencies,
  • benchmark with scheme world dump, and commit the resulting,
  • need to split the number of cores between wiredtiger and the app.
  • Make thread-pool size configureable,
  • okvs fts: maybe-index and reindex (delete + add)
  • federation
  • search pad
  • babelia api secret show: add it
  • babelia api secret generate: add it
  • okvs/fts:
    • OR support
    • proximity bonus
    • keyword weight
    • one way synonyms
    • two way synonyms
    • phrase matching
    • td-idf
  • babelia crawler sitemap support
  • babelia crawler wikimedia: use rest api, otherwise fallback to wiki.

TODO

  • okvs nstore: improve prefix handling.
  • spell checking
  • sensimark
  • okvs pack: optimize algorithm of nested list with a single pass
  • okvs pack: past argument as a list instead of rest
  • babelia index: warc file input
  • babelia crawler: output warc file
  • check.scm: make it possible to execute tests from low level to high level (or high level to low level)
  • more validation, use R7RS raise and guard (to make the transaction fail),
  • entity recognition,
  • inbound links,
  • domain or page outbound links,
  • page rank.
  • gumbo bindings https://github.com/google/gumbo-parser

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages