Releases: J535D165/recordlinkage
Releases · J535D165/recordlinkage
Version 0.9.0 (21 June 2017)
- A new index API. The new index API is no longer a single class
(recordlinkage.Pairs(...)
) with all the functionality in it. The new API
is based on Tensorflow and FEBRL. With the new structure, it easier to
parallise the record linkage process. In future releases, this will be
implemented natively.See the reference page for more information and migrating. <http://recordlinkage.readthedocs.io/en/latest/ref-index.html>
_ - Significant speed improvement of the Sorted Neighbourhood Indexing
algorithm. Thanks to @perryvais (PR #32). - The function
binary_comparisons
is renamed. The new name of the function
isbinary_vectors
. Documentation added to RTD. - Added unit tests to test the generation of random comparison vectors.
- Logging module added to separate module logs from user logs. The
implementation is based on Tensorflow.
Version 0.8.1 (27 Jan 2017)
- Issues solved with rendering docs on ReadTheDocs. Still not clear what is
going on with theautodoc_mock_imports
in the sphinx conf.py file. Maybe
a bug in sphinx. - Move six to dependencies.
- The reference part of the docs is split into separate subsections. This
makes the reference better readable. - The landing page of the docs is slightly changed.
Version 0.8.0 (22 Jan 2017)
- Add additional arguments to the function that downloads and loads the
krebsregister data. The argumentmissing_values
is used to fill missing
values. Default: nothing is done. The argumentshuffle
is used to
shuffle the records. Default is True. - Remove the lastest traces of the old package name. The new package name is
'Python Record Linkage Toolkit' - Better error messages when there are only matches or non-matches are passed
to train the classifier. - Add AirSpeedVelocity tests to test the performance.
- Compare for deduplication fixed. It was broken.
- Parameterized tests for the
Compare
class and its algorithms. Making use
ofnose-parameterized
module. - Update documentation about contributing.
- Bugfix/improvement when blocking on multiple columns with missing values.
- Fix bug #29. Package
not working with pandas 0.18 and 0.17. Dropped support pandas 0.17 and fixed
support for 0.18. Also added multi-dendency tests for TravisCI. - Support for dedicated deduplication algorithms
- Special algorithm for full index in case of finding duplicates. Performce is
100x better. - Function
max_number_of_pairs
to get the maximum number of pairs. low_memory
for compare class.- Improved performance in case of comparing a large number of record pairs.
- New documentation about custom algorithms
- New documentation about the use of classifiers.
- Possible to compare arrays and series directly without using labels.
- Make a dataframe with random comparison vectors with the
binary_comparisons
in therecordlinkage.datasets.random
module. - Set KMeans cluster centers by hand.
- Various documentation updates and improvements.
- Jellyfish is now a required dependency. Fixes bug #30.
- Added
tox.ini
to test packaging and installation of package. - Drop requirements.txt file.
- Many small fixes and changes. Most of the changes cover the
Compare
module. Especially label handling is improved.
Version 0.7.2 (9 Nov 2016)
v0.7.2 Bugfix in levenshtein algorithms
Version 0.7.1 (9 Nov 2016)
v0.7.1 Improve importing workflow + dist bug fix
Version 0.6.0
This version includes the following updates:
- Reformatting the code such that it follows PEP8.
- Add Travis-CI and codecov support.
- Switch to distributing wheels.
- Fix bugs with depreciated pandas functions.
__sub__
is no longer used for computing the difference of Index objects. It is now replaced by ``INDEX.difference(OTHER_INDEX). - Exclude pairs with NaN's on the index-key in Q-gram indexing.
- Add tests for krebsregister dataset.
- Fix Python3 bug on krebsregister dataset.
- Improve unicode handling in phonetic encoding functions.
- Strip accents with the
clean
function. - Add documentation
- Bug for random indexing with incorrect arguments fixed and tests added.
- Improved deployment workflow
- And much more
Version 0.5.0 (9 Sep 2016)
- Batch comparing added. Signifant speed improvement.
- rldatasets are now included in the package itself.
- Added an experimental gender imputation tool.
- Blocking and SNI skip missing values
- No longer need for different index names
- FEBRL datasets included
- Unit tests for indexing and comparing improved
- Documentation updated
Version 0.4.0 (20 Aug 2016)
- Fixes a serious bug with deduplication (thanks to https://github.com/dserban).
- Fixes undesired behaviour for sorted neighbourhood indexing with missing values.
- Add new datasets to the package like Febrl datasets
- Move Krebsregister dataset to this package.
- Improve and add some tests
- Various documentation updates
Version 0.3.1: Fix installation bug
v0.3.1 Fix problems with installing with pip
Version 0.3 (11 June 2016)
This version contains a lot of changes to the API. Hopefully, there are no large API changes needed for now.
- Total restructure of compare functions (The end of changing the API is close to now.)
- Compare method
numerical
is now namednumeric
andfuzzy
is now namedstring
. - Add haversine formula to compare geographical records.
- Use numexpr for computing numeric comparisons.
- Add step, linear and squared comparing.
- Add eye index method.
- Improve, update and add new tests.
- Remove iterative indexing functions.
- New add chunks for indexing functions. These chunks are defined in the class Pairs. If chunks are defined, then the indexing functions returns a generator with an Index for each element.
- Update documentation.
- Various bug fixes.