Merge pull request #3542 from lonvia/remove-legacy-tokenizer

Remove legacy tokenizer
osm-search · Sep 24, 2024 · d856788 · d856788
2 parents e92e03e + f960a9b
commit d856788
Show file tree

Hide file tree

Showing 53 changed files with 59 additions and 4,339 deletions.
diff --git a/.pylintrc b/.pylintrc
@@ -13,10 +13,10 @@ ignored-classes=NominatimArgs,closing
 # 'too-many-ancestors' is triggered already by deriving from UserDict
 # 'not-context-manager' disabled because it causes false positives once
 #   typed Python is enabled. See also https://github.com/PyCQA/pylint/issues/5273
-disable=too-few-public-methods,duplicate-code,too-many-ancestors,bad-option-value,no-self-use,not-context-manager,use-dict-literal,chained-comparison,attribute-defined-outside-init,too-many-boolean-expressions,contextmanager-generator-missing-cleanup
+disable=too-few-public-methods,duplicate-code,too-many-ancestors,bad-option-value,no-self-use,not-context-manager,use-dict-literal,chained-comparison,attribute-defined-outside-init,too-many-boolean-expressions,contextmanager-generator-missing-cleanup,too-many-positional-arguments
 
 good-names=i,j,x,y,m,t,fd,db,cc,x1,x2,y1,y2,pt,k,v,nr
 
 [DESIGN]
 
-max-returns=7
+max-returns=7
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -44,7 +44,6 @@ endif()
 
 set(BUILD_IMPORTER on CACHE BOOL "Build everything for importing/updating the database")
 set(BUILD_API on CACHE BOOL "Build everything for the API server")
-set(BUILD_MODULE off CACHE BOOL "Build PostgreSQL module for legacy tokenizer")
 set(BUILD_TESTS on CACHE BOOL "Build test suite")
 set(BUILD_OSM2PGSQL on CACHE BOOL "Build osm2pgsql (expert only)")
 set(INSTALL_MUNIN_PLUGINS on CACHE BOOL "Install Munin plugins for supervising Nominatim")
@@ -139,14 +138,6 @@ if (BUILD_TESTS)
     endif()
 endif()
 
-#-----------------------------------------------------------------------------
-# Postgres module
-#-----------------------------------------------------------------------------
-
-if (BUILD_MODULE)
-    add_subdirectory(module)
-endif()
-
 #-----------------------------------------------------------------------------
 # Installation
 #-----------------------------------------------------------------------------
@@ -195,11 +186,6 @@ if (BUILD_OSM2PGSQL)
     endif()
 endif()
 
-if (BUILD_MODULE)
-    install(PROGRAMS ${PROJECT_BINARY_DIR}/module/nominatim.so
-            DESTINATION ${NOMINATIM_LIBDIR}/module)
-endif()
-
 install(FILES settings/env.defaults
               settings/address-levels.json
               settings/phrase-settings.json

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -61,8 +61,7 @@ pylint3 --extension-pkg-whitelist=osmium nominatim
 Before submitting a pull request make sure that the tests pass:
 
 ```
-  cd build
-  make test
+  make tests
 ```
 
 ## Releases

diff --git a/docs/admin/Advanced-Installations.md b/docs/admin/Advanced-Installations.md
@@ -131,76 +131,13 @@ script ([Geofabrik](https://download.geofabrik.de)) provides daily updates.
 
 ## Using an external PostgreSQL database
 
-You can install Nominatim using a database that runs on a different server when
-you have physical access to the file system on the other server. Nominatim
-uses a custom normalization library that needs to be made accessible to the
-PostgreSQL server. This section explains how to set up the normalization
-library.
-
-!!! note
-    The external module is only needed when using the legacy tokenizer.
-    If you have chosen the ICU tokenizer, then you can ignore this section
-    and follow the standard import documentation.
-
-### Option 1: Compiling the library on the database server
-
-The most sure way to get a working library is to compile it on the database
-server. From the prerequisites you need at least cmake, gcc and the
-PostgreSQL server package.
-
-Clone or unpack the Nominatim source code, enter the source directory and
-create and enter a build directory.
-
-```sh
-cd Nominatim
-mkdir build
-cd build
-```
-
-Now configure cmake to only build the PostgreSQL module and build it:
-
-```
-cmake -DBUILD_IMPORTER=off -DBUILD_API=off -DBUILD_TESTS=off -DBUILD_DOCS=off -DBUILD_OSM2PGSQL=off ..
-make
-```
-
-When done, you find the normalization library in `build/module/nominatim.so`.
-Copy it to a place where it is readable and executable by the PostgreSQL server
-process.
-
-### Option 2: Compiling the library on the import machine
-
-You can also compile the normalization library on the machine from where you
-run the import.
-
-!!! important
-    You can only do this when the database server and the import machine have
-    the same architecture and run the same version of Linux. Otherwise there is
-    no guarantee that the compiled library is compatible with the PostgreSQL
-    server running on the database server.
-
-Make sure that the PostgreSQL server package is installed on the machine
-**with the same version as on the database server**. You do not need to install
-the PostgreSQL server itself.
-
-Download and compile Nominatim as per standard instructions. Once done, you find
-the normalization library in `build/module/nominatim.so`. Copy the file to
-the database server at a location where it is readable and executable by the
-PostgreSQL server process.
-
-### Running the import
-
-On the client side you now need to configure the import to point to the
-correct location of the library **on the database server**. Add the following
-line to your your `.env` file:
-
-```
-NOMINATIM_DATABASE_MODULE_PATH="<directory on the database server where nominatim.so resides>"
-```
-
-Now change the `NOMINATIM_DATABASE_DSN` to point to your remote server and continue
-to follow the [standard instructions for importing](Import.md).
+You can install Nominatim using a database that runs on a different server.
+Simply point the configuration variable `NOMINATIM_DATABASE_DSN` to the
+server and follow the standard import documentation.
 
+The import will be faster, if the import is run directly from the database
+machine. You can easily switch to a different machine for the query frontend
+after the import.
 
 ## Moving the database to another machine
 
@@ -225,20 +162,9 @@ target machine.
     data updates but the resulting database is only about a third of the size
     of a full database.
 
-Next install Nominatim on the target machine by following the standard installation
-instructions. Again, make sure to use the same version as the source machine.
+Next install nominatim-api on the target machine by following the standard
+installation instructions. Again, make sure to use the same version as the
+source machine.
 
 Create a project directory on your destination machine and set up the `.env`
-file to match the configuration on the source machine. Finally run
-
-    nominatim refresh --website
-
-to make sure that the local installation of Nominatim will be used.
-
-If you are using the legacy tokenizer you might also have to switch to the
-PostgreSQL module that was compiled on your target machine. If you get errors
-that PostgreSQL cannot find or access `nominatim.so` then rerun
-
-    nominatim refresh --functions
-
-on the target machine to update the the location of the module.
+file to match the configuration on the source machine. That's all.
diff --git a/docs/admin/Installation.md b/docs/admin/Installation.md
@@ -178,18 +178,6 @@ make
 sudo make install
 ```
 
-!!! warning
-    The default installation no longer compiles the PostgreSQL module that
-    is needed for the legacy tokenizer from older Nominatim versions. If you
-    are upgrading an older database or want to run the
-    [legacy tokenizer](../customize/Tokenizers.md#legacy-tokenizer) for
-    some other reason, you need to enable the PostgreSQL module via
-    cmake: `cmake -DBUILD_MODULE=on ../Nominatim`. To compile the module
-    you need to have the server development headers for PostgreSQL installed.
-    On Ubuntu/Debian run: `sudo apt install postgresql-server-dev-<postgresql version>`
-    The legacy tokenizer is deprecated and will be removed in Nominatim 5.0
-
-
 Nominatim installs itself into `/usr/local` per default. To choose a different
 installation directory add `-DCMAKE_INSTALL_PREFIX=<install root>` to the
 cmake command. Make sure that the `bin` directory is available in your path

diff --git a/docs/customize/Settings.md b/docs/customize/Settings.md
@@ -64,26 +64,6 @@ Nominatim grants minimal rights to this user to all tables that are needed
 for running geocoding queries.
 
 
-#### NOMINATIM_DATABASE_MODULE_PATH
-
-| Summary            |                                                     |
-| --------------     | --------------------------------------------------- |
-| **Description:**   | Directory where to find the PostgreSQL server module |
-| **Format:**        | path |
-| **Default:**       | _empty_ (use `<project_directory>/module`) |
-| **After Changes:** | run `nominatim refresh --functions` |
-| **Comment:**       | Legacy tokenizer only |
-
-Defines the directory in which the PostgreSQL server module `nominatim.so`
-is stored. The directory and module must be accessible by the PostgreSQL
-server.
-
-For information on how to use this setting when working with external databases,
-see [Advanced Installations](../admin/Advanced-Installations.md).
-
-The option is only used by the Legacy tokenizer and ignored otherwise.
-
-
 #### NOMINATIM_TOKENIZER
 
 | Summary            |                                                     |
@@ -114,20 +94,6 @@ on the file format.
 If a relative path is given, then the file is searched first relative to the
 project directory and then in the global settings directory.
 
-#### NOMINATIM_MAX_WORD_FREQUENCY
-
-| Summary            |                                                     |
-| --------------     | --------------------------------------------------- |
-| **Description:**   | Number of occurrences before a word is considered frequent |
-| **Format:**        | int |
-| **Default:**       | 50000 |
-| **After Changes:** | cannot be changed after import |
-| **Comment:**       | Legacy tokenizer only |
-
-The word frequency count is used by the Legacy tokenizer to automatically
-identify _stop words_. Any partial term that occurs more often then what
-is defined in this setting, is effectively ignored during search.
-
 
 #### NOMINATIM_LIMIT_REINDEXING
 
@@ -162,25 +128,6 @@ codes, to restrict import to a subset of languages.
 Currently only affects the initial import of country names and special phrases.
 
 
-#### NOMINATIM_TERM_NORMALIZATION
-
-| Summary            |                                                     |
-| --------------     | --------------------------------------------------- |
-| **Description:**   | Rules for normalizing terms for comparisons |
-| **Format:**        | string: semicolon-separated list of ICU rules |
-| **Default:**       | :: NFD (); [[:Nonspacing Mark:] [:Cf:]] >;  :: lower (); [[:Punctuation:][:Space:]]+ > ' '; :: NFC (); |
-| **Comment:**       | Legacy tokenizer only |
-
-[Special phrases](Special-Phrases.md) have stricter matching requirements than
-normal search terms. They must appear exactly in the query after this term
-normalization has been applied.
-
-Only has an effect on the Legacy tokenizer. For the ICU tokenizer the rules
-defined in the
-[normalization section](Tokenizers.md#normalization-and-transliteration)
-will be used.
-
-
 #### NOMINATIM_USE_US_TIGER_DATA
 
 | Summary            |                                                     |

diff --git a/docs/customize/Tokenizers.md b/docs/customize/Tokenizers.md
@@ -15,53 +15,6 @@ they can be configured.
     chosen tokenizer is very limited as well. See the comments in each tokenizer
     section.
 
-## Legacy tokenizer
-
-!!! danger
-    The Legacy tokenizer is deprecated and will be removed in Nominatim 5.0.
-    If you still use a database with the legacy tokenizer, you must reimport
-    it using the ICU tokenizer below.
-
-The legacy tokenizer implements the analysis algorithms of older Nominatim
-versions. It uses a special Postgresql module to normalize names and queries.
-This tokenizer is automatically installed and used when upgrading an older
-database. It should not be used for new installations anymore.
-
-### Compiling the PostgreSQL module
-
-The tokeinzer needs a special C module for PostgreSQL which is not compiled
-by default. If you need the legacy tokenizer, compile Nominatim as follows:
-
-```
-mkdir build
-cd build
-cmake -DBUILD_MODULE=on
-make
-```
-
-### Enabling the tokenizer
-
-To enable the tokenizer add the following line to your project configuration:
-
-```
-NOMINATIM_TOKENIZER=legacy
-```
-
-The Postgresql module for the tokenizer is available in the `module` directory
-and also installed with the remainder of the software under
-`lib/nominatim/module/nominatim.so`. You can specify a custom location for
-the module with
-
-```
-NOMINATIM_DATABASE_MODULE_PATH=<path to directory where nominatim.so resides>
-```
-
-This is in particular useful when the database runs on a different server.
-See [Advanced installations](../admin/Advanced-Installations.md#using-an-external-postgresql-database) for details.
-
-There are no other configuration options for the legacy tokenizer. All
-normalization functions are hard-coded.
-
 ## ICU tokenizer
 
 The ICU tokenizer uses the [ICU library](http://site.icu-project.org/) to

diff --git a/docs/develop/Testing.md b/docs/develop/Testing.md
@@ -72,8 +72,6 @@ The tests can be configured with a set of environment variables (`behave -D key=
  * `DB_PORT` - (optional) port of database on host
  * `DB_USER` - (optional) username of database login
  * `DB_PASS` - (optional) password for database login
- * `SERVER_MODULE_PATH` - (optional) path on the Postgres server to Nominatim
-                          module shared library file (only needed for legacy tokenizer)
  * `REMOVE_TEMPLATE` - if true, the template and API database will not be reused
                        during the next run. Reusing the base templates speeds
                        up tests considerably but might lead to outdated errors