You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The IMS Open Corpus Workbench (CWB) is probably a major cause for the slowness of the Korp backend, so replacing it with faster software would be appealing. However, what would be the alternatives?
Even though replacing CWB might not be actual anytime soon, I think it might thus be good to know of and keep track of some alternatives, and maybe also alternatives to Korp itself, even if that might sound like making the work on Korp pointless.
(Maybe at least some of the information below could be added as a wiki page to the korp-notes repository.)
Current CWB (3.5)
Version 3.5.0 of CWB was released in summer 2022. It is a stable version of the 3.4 development series that has been developed slowly over the last 11 years or so. According to the CWB documentation:
Version 3.5 is the current, and probably final, stable version of the "original" Corpus Workbench.
The future of CWB: CWB 4
The CWB developers have some quite impressive plans for CWB 4, which will break backward-compatibility with CWB 3. However, given the pace of CWB development, I think it will most likely take many years before CWB 4 is working and stable enough.
The CWB development model is not very open: as far as I can see, CWB is being developed mainly by two developers – university people – in their spare time. Even though that might be justified by the complexity (and perhaps obscurity) of the legacy code, I must admit that in general, I like Git and GitHub more than Subversion and SourceForge still used for CWB.
BlackLab is written in Java, built on Apache Lucene, and it comprises a library and a Web service. It would seem to be actively developed. It also seems to have a somewhat Korp-like frontend.
However, I don’t quite like the fact that it’s a limited version of the commercial Sketch Engine (used in several CLARIN sites), to whose documentation NoSketch Engine only refers. The open-source releases would also seem to be in the form of tarballs and RPMs, without version control access. I also recall reading in some documentation that it doesn’t currently scale up to much larger corpora than CWB.
Others
Years ago, I was referred to Corpuscle, developed at Clarino Centre Bergen. I now found the corpus site but little or no information on the software itself or its general availability.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
The IMS Open Corpus Workbench (CWB) is probably a major cause for the slowness of the Korp backend, so replacing it with faster software would be appealing. However, what would be the alternatives?
Even though replacing CWB might not be actual anytime soon, I think it might thus be good to know of and keep track of some alternatives, and maybe also alternatives to Korp itself, even if that might sound like making the work on Korp pointless.
(Maybe at least some of the information below could be added as a wiki page to the korp-notes repository.)
Current CWB (3.5)
Version 3.5.0 of CWB was released in summer 2022. It is a stable version of the 3.4 development series that has been developed slowly over the last 11 years or so. According to the CWB documentation:
The future of CWB: CWB 4
The CWB developers have some quite impressive plans for CWB 4, which will break backward-compatibility with CWB 3. However, given the pace of CWB development, I think it will most likely take many years before CWB 4 is working and stable enough.
The CWB development model is not very open: as far as I can see, CWB is being developed mainly by two developers – university people – in their spare time. Even though that might be justified by the complexity (and perhaps obscurity) of the legacy code, I must admit that in general, I like Git and GitHub more than Subversion and SourceForge still used for CWB.
BlackLab
In the Korp developers’ meeting on 2023-01-17, @majsan mentioned BlackLab, developed at INL in the Netherlands. I hadn’t known of it before.
BlackLab is written in Java, built on Apache Lucene, and it comprises a library and a Web service. It would seem to be actively developed. It also seems to have a somewhat Korp-like frontend.
However, according to the documentation, it doesn’t (yet) seem to support parallel corpora nor regular expression searches on structural (XML) attributes, but they are in the future plans.
NoSketch Engine
Another alternative might be NoSketch Engine.
However, I don’t quite like the fact that it’s a limited version of the commercial Sketch Engine (used in several CLARIN sites), to whose documentation NoSketch Engine only refers. The open-source releases would also seem to be in the form of tarballs and RPMs, without version control access. I also recall reading in some documentation that it doesn’t currently scale up to much larger corpora than CWB.
Others
Years ago, I was referred to Corpuscle, developed at Clarino Centre Bergen. I now found the corpus site but little or no information on the software itself or its general availability.
Beta Was this translation helpful? Give feedback.
All reactions