This project contains pre-trained Machine Learning models which can be used with the Datumbox Machine Learning Framework v0.8.3-SNAPSHOT (Build 20201014).
Copyright (c) 2013-2020 Vasilis Vryniotis.
Licensed under the Apache License, Version 2.0.
The project contains the binary files of all the text classification models which are available via the Datumbox API:
- Sentiment Analysis: The Sentiment Analysis model classifies documents as positive, negative or neutral (lack of sentiment) depending on whether they express a positive, negative or neutral opinion.
- Twitter Sentiment Analysis: The Twitter Sentiment Analysis model allows you to perform Sentiment Analysis on Twitter. It classifies the tweets as positive, negative or neutral depending on their context.
- Subjectivity Analysis: The Subjectivity Analysis model categorizes documents as subjective or objective based on their writing style. Texts that express personal opinions are labeled as subjective and the others as objective.
- Topic Classification: The Topic Classification model assigns documents in 12 thematic categories based on their keywords, idioms and jargon. It can be used to identify the topic of the texts.
- Spam Detection: The Spam Detection model labels documents as spam or nospam by taking into account their context. It can be used to filter out spam emails and comments.
- Adult Content Detection: The Adult Content Detection model classifies the documents as adult or no-adult based on their context. It can be used to detect whether a document contains content unsuitable for minors.
- Language Detection: The Language Detection model identifies the natural language of the given document based on its words and context. This classifier is able to detect 96 different languages.
- Commercial Detection: The Commercial Detection model labels the documents as commercial or non-commercial based on their keywords and expressions. It can be used to detect whether a website is commercial or not.
- Educational Detection: The Educational Detection model classifies the documents as educational or non-educational based on their context. It can be used to detect whether a website is educational or not.
Important Notes:
- The models support only English.
- The binary files should be loaded using their corresponding Framework version.
- All the models should be loaded using the InMemory storage engine.
- Within the folder of each model you will find a stats.txt file which contains the accuracy metrics of the classifier. The metrics were estimated using 10-fold cross validation.
- All the remaining API methods which are not included here (Readability Assessment, Keyword Extraction, Text Extraction & Document Similarity) are directly powered up by standalone classes of the framework.
- Download/clone this project locally.
- Open your datumbox.configuration.properties file and make sure you use the InMemory engine by default:
configuration.storageConfiguration=com.datumbox.framework.storage.inmemory.InMemoryConfiguration
- Open your datumbox.inmemoryconfiguration.properties file and update the directory:
inMemoryConfiguration.directory=/path/to/datumbox-framework-zoo
- Within your project initialize the classifiers using their name:
Configuration configuration = Configuration.getConfiguration(); TextClassifier textClassifier = MLBuilder.load(TextClassifier.class, "SentimentAnalysis", configuration); System.out.println(textClassifier.predict("Datumbox is amazing!").getYPredicted());
Note that it is also possible to skip steps 2 & 3 and instead programmatically update the configuration object before initializing the classifier:
Configuration configuration = Configuration.getConfiguration();
InMemoryConfiguration storageConfiguration = new InMemoryConfiguration();
storageConfiguration.setDirectory("/path/to/datumbox-framework-zoo");
configuration.setStorageConfiguration(storageConfiguration);