-
Notifications
You must be signed in to change notification settings - Fork 7
OPF file format for datasets
As LibDEEP uses the same format as LibOPF datasets, the LibOPF package contains a directory LibOPF/tools, in which you can find some useful tools.
-
txt2opf: a program to convert OPF files written in ASCII format to binary format.
-
opf2txt: a program to convert OPF files written in binary format to ASCII format.
-
opf_check: a program to check whether a file is in the OPF required format.
-
opf2svm: a program to convert binary OPF files to LibSVM format.
-
svm2opf: a program to convert LibSVM files to binary OPF format.
The original dataset and its parts training, evaluation and test sets must be in the following BINARY file format:
<# of samples> <# of labels> <# of features>
<0> <label> <feature 1 from element 0> <feature 2 from element 0> ...
<1> <label> <feature 1 from element 1> <feature 2 from element 1> ...
.
.
<i> <label> <feature 1 from element i> <feature 2 from element i> ...
<i+1> <label> <feature 1 from element i+1> <feature 2 from element i+1> ...
.
.
<n-1> <label> <feature 1 from element n-1> <feature 2 from element n-1> ...
The first number of each line, <0>, <1>, ... <n-1>, is a sample identifier (for n samples in the dataset), which is used in the case of precomputed distances. However, the identifier must be specified anyway. For unlabeled datasets, please use label 0 for all samples (unsupervised OPF).
Example: Suppose that you have a dataset with 5 samples, distributed into 3 classes, with 2 elements from label 1, 2 elements from label 2 and 1 element from label 3. Each sample is represented by a feature vector of size 2. So, the OPF file format should look like as below:
5 3 2
0 1 0.21 0.45
1 1 0.22 0.43
2 2 0.67 1.12
3 2 0.60 1.11
4 3 0.79 0.04
Comment #1: Note that, the file must be binary with no blank spaces. This ASCII representation is just for illustration.
Comment #2: The first line of the file, 5 3 2, contains, respectively, the dataset size, the number of labels (classes) and the number of features in the feature vectors. The remaining lines contain the sample identifier (integer from 0 to n-1, in which n is the dataset size), its label and the feature values for each sample.