-
Notifications
You must be signed in to change notification settings - Fork 17
ml: basics
These pages will include tools, experiences, tutorials on how to do simple ML on our corpora
## Scope openVirus will almost certainly be limited to working with words (i.e. not images, speech, etc.).
There is no magic.
ML requires good data, good conceptual models, good annotation and constant testing and re-modelling. It may well require new code. The amount that can be done in a month is limited. None the less we can make useful discoveries in technology and hopefully some initial categorization.
Most of the operations will be classification. Wikipedia explains it well (https://en.wikipedia.org/wiki/Statistical_classification).
From Wikipedia:
Feature vectorsMost algorithms describe an individual instance whose category is to be predicted using a feature vector of individual, measurable properties of the instance. Each property is termed a feature, also known in statistics as an explanatory variable (or independent variable, although features may or may not be statistically independent). Features may variously be binary (e.g. "on" or "off"); categorical (e.g. "A", "B", "AB" or "O", for blood type); ordinal (e.g. "large", "medium" or "small"); integer-valued (e.g. the number of occurrences of a particular word in an email); or real-valued (e.g. a measurement of blood pressure). If the instance is an image, the feature values might correspond to the pixels of an image; if the instance is a piece of text, the feature values might be occurrence frequencies of different words. Some algorithms work only in terms of discrete data and require that real-valued or integer-valued data be discretized into groups (e.g. less than 5, between 5 and 10, or greater than 10).
PMR: My experience is that it pays to have good feature extraction before the ML. That's why dictionaries and sectioning are so important. For example if you do word frequency analysis on our current corpora the commonest words in section titles are likely to be "Introduction" , "methods", etc. This would be useful to distinguish our corpus from sports reports ("Teams", "Fixtures") but for science they occur in most papers so there is a lot of "noise". However it might help to distinguish articles from reviews.