-
Notifications
You must be signed in to change notification settings - Fork 0
Korp developers' meeting 2023 01 16
Jyrki Niemi edited this page Jan 17, 2023
·
1 revision
Participants:
- Språkbanken Text:
- Martin Hammarstedt
- Maria Öhrman
- Kielipankki (The Language Bank of Finland):
- Sam Hardwick (CSC)
- Helmiina Hotti (CSC)
- Anni Järvenpää (CSC)
- Kaisa Kuivalahti (CSC)
- Martin Matthiesen (CSC)
- Jyrki Niemi (University of Helsinki)
- Martin H: Mink will be released internally at the end of January.
- Martin M: Leif-Jöran Olsson had done some experiments with Elasticsearch replacing CWB.
- Fast with few attributes but if you have the full set of attributes and as much data as in Språkbanken’s Korp, Elasticsearch is as slow as CWB.
- Maria: Språkbanken have taken a look at corpus search tool called BlackLab.
- Somewhat faster than CWB but can only search from one corpus at a time.
- [Jyrki’s note afterwards: BlackLab seems to be actively developed, but based on the information on the website, it seems that it does not (yet) support parallel corpora nor regular expression searches for structural (XML) attributes.]
- Martin H. sped up the word picture significantly last autumn.
- Martin M: According to Leif-Jöran, if you have identical backends and put them behind a proxy it improves speed if they access data on different disks (RAID controllers).
- Martin H: That should work, but caching could be a problem.
- Martin H: If a single search was split up by corpora to different servers, it should speed up searches with many corpora.
- That was planned at some point at Språkbanken but it hasn’t been implemented.
- Martin H: Anyway, for a large number of simultaneous searches, Korp gets really slow.
- Anni: What about updating the data and the caches getting out of date?
- Anni: Some slowness is due to other issues than many simultaneous users.
- Martin M: Does Språkbanken have this kind of plans?
- Martin H: No, as currently no spare servers.
- Sam: Is there documentation on what subset of CQP Korp uses?
- Martin H: All the regular query features, but only for regular KWIC and statistics data.
- Sam: Sharding the corpus data might help.
- Maria: GitHub discussions are now enabled in the frontend and backend repositories.
- Not much discussion yet.
- A part-time developer will substitute Maria for at least the next autumn.
- It is hoped that he will continue with Korp after Maria will be back.
- Martin M: We would need to transfer knowledge to the newer developers, which might require more collaboration.
- Martin M: Korp developers’ meetings of this kind could take place roughly once a quarter.
- Jyrki: Should we invite also other parties to this kinds of meetings, as e.g. the Icelanders seem active now?
- Maria: Yes, the Icelanders and someone from Denmark.
- Martin M: What about a Korp SIG in CLARIN?
- It could have a mailing list on which you could announce meetings of this kind to interested people.
- Martin M. intends to propose this in the next CLARIN Centre Committee meeting.
- Martin H. and Maria: Sounds good.
- Maria: The Korp frontend is currently using AngularJS which is end-of-life.
- Maria: Korp will not be moved out of AngularJS quite yet, but the goal is to move to Vue.js 3.
- Maria: Refactoring everything into components will reduce dependencies, as they will have a clearly-defined input and output.
- Maria has made recently some changes which should make it easier to change the framework (branch
refactor
).- Changes in the result tabs in particular.
- Maria hopes to have the changes integrated to dev this weeks.
- Maria: In general, pull requests should be based on the
dev
branch.
- Maria will also add some code for Shibboleth login.
- Martin M: We should use Språkbanken’s Shibboleth code.
- Maria: Yes, if the functionality is the same.
- Martin M: We also use a database for users who have applied for access to restricted corpora.
- Maria: Språkbanken also uses a database but probably very differently.
- Språkbanken is using JWT, to which credentials are added from the db.
- Subcorpora are currently handled manually.
- Martin M: The newest version of our rights management system only supports OpenID Connect.
- Maria: No short-term changes are expected, but eventually some changes should be made.
- Maria: E.g.
DisplayType
with only a single supported value is not a good practice.
- Maria: E.g.
- Martin M: Versioning the configuration would be a good idea.
- Jyrki has some enhancements or additions for Kielipankki.
- They should be backwards-compatible, so Jyrki will eventually create pull requests for them.
- Maria: Maybe we could use GitHub discussions to discuss the changes first.
- Jyrki: What is the priority of Korp at Språkbanken now?
- How much time Maria and Martin H. have for Korp?
- Martin H. will only have 10 % of time for Korp this spring.
- However, 40 % of time for Mink, so if Mink needs something in Korp, it can be developed.
- Maria has 75 % of Korp development this spring
- Maria: The backend is fairly stable with few bugs.
- Martin H: Historically, the frontend has been neglected.
- The backend has features that the frontend does not use.
- Martin M: The missing scalability of the backend is a problem.
- Martin M: We have some resources in Finland now (at CSC).
- We could perhaps implement some features and Språkbanken could play the role of a benevolent dictator and say if they can be accepted or not.
- We could e.g. provide the data sharding.
- Maria: Probably no-one has done anything to testing.
- Martin M: Anni, Kaisa and Helmiina have at least some experience with pytest.
- Anni: Least familiar with frontend testing.
- Anni: If new people will be developing, it would be great to have some tests.
- Martin M: Some lightweight frontend testing would be good.
- Martin M: Backend testing would be even more important.
- Sam: Adding unit-tests later may be difficult, but adding functionality testing is not that hard.
- Martin M: Not hard, but needs a considerable amount of time.
- Sam/Martin M: Start from tests for known broken and new functionality.
- Anni: Some kind of black-box testing would be something to start from.
- For existing and new functionality.
- Martin M: Would we agree to start with testing the backend?
- Martin H: Yes. Start from a single test and figure out how that works, then add tests when have time for that.
- Martin M: What about using Apache Airflow for testing?
- Anni: Would not be needed unless we need to test continuously.
- Martin M: It is also possible that the corpora change slightly.
- Jyrki: That is not related to Korp code.
- Maria might also work on the frontend tests.
- The frontend is using the Protractor testing framework for AngularJS.
- [Jyrki’s note afterwards: According to the Protractor website, “Protractor is deprecated and will reach end-of-life in the summer of 2023”.]
- Maria has taken a look at some other frameworks, too.
- The frontend is using the Protractor testing framework for AngularJS.
- Martin M: Based on previous experience, if the frontend tests are too detailed, you may end up fixing the tests when the browser is updated.
- Martin M. told about the CLARIN Federated Content Search (FCS) for Korp in Kielipankki.
- Leif-Jöran Olsson visited CSC last week and we got a proof-of-concept FCS for Kielipankki Korp working.
- The FCS Korp endpoint reference implementation is somewhat out of date and should be made easier to configure.
- Martin M: Kaisa is our Java guru, so she might implement the production version.
- Maria: Språkbanken now has a backend with JWT authentication.
- However, JWT tokens are difficult to get.
- Is there a solution in Finland?
- Martin M: We would also have backend users for restricted corpora, but we have not implemented it yet.
- OpenID Connect (OIDC) could be a solution.
- We currently have a proxy that translates between OIDC and Shibboleth.
- Maria: Would it be easy to allow users to log in with Google or Faceook?
- Martin M: If you wish that, you get that free, but we don’t, as ACA corpora need the attributes from a university or such, and for RES corpora a university email is easier to verify.
- Maria: Sometimes you would like to invite someone, so verification is not a problem.
- Martin M: It’s the Finnish team’s turn to organize the meeting.
- It was agreed that the meeting would be in April 2023, after Easter.