Data Explorer is a tool for performing various operations on datasets. It supports operations like aggregation (average, minimum, and maximum) and grouping, providing a simple interface for querying data.
Clone and use CMake with GCC/Clang/MSVC to compile the project and tests from an IDE or command line. CMake should:
- configure everything automatically,
- download and build dependency,
- compile and create binaries.
Tool | Windows | Ubuntu |
---|---|---|
OS version | 10 22H2 | 24.04 |
GCC | 13.1.0 | 13.2.0 |
CMake | 3.30.2 | 3.28.3 |
Git | 2.46.0 | 2.43.0 |
cpputils | 1.0.0 | 1.0.0 |
GoogleTest | 1.15.2 | 1.15.2 |
Launch the binary and pass a single parameter, which is the file name containing data:
$ data-explorer sample.txt
Application expects a series of commands as text lines. Each line should have the following structure:
<operation> <aggregation column> <grouping column>
where:
operation
is an aggregation function of one of the following types:avg
: calculates the average of a set of values.min
: finds the minimum value in a set of values.max
: finds the maximum value in a set of values.
aggregation column
is a numerical data column using which aggregation will be done.grouping column
is a column used for grouping results.
Usage Example:
$ avg score movie_name
Example output:
Execute: AVG score GROUPED BY movie_name
Execution time: 12us
Results:
ender's_game 8
pulp_fiction 6
inception 8
This project is licensed under the MIT License. See the LICENSE file for details.
The project uses the following open-source software:
Name | License | Home | Description |
---|---|---|---|
cpputils | MIT | https://github.com/przemek83/cpputils | collection of C++ utility classes |
GoogleTest | BSD-3-Clause | https://github.com/google/googletest | testing framework |
For testing purposes, gtest
framework is used. Build the project first. Make sure that the data-explorer-test
target is built. Modern IDEs supporting CMake also support running tests with monitoring of failures. But in case you would like to run it manually, go to the build/test
directory, where the binary data-explorer-test
should be available. Launching it should produce the following output on Linux:
$ ./data-explorer-test
Running main() from <path>/data-explorer/build/_deps/googletest-src/googletest/src/gtest_main.cc
[==========] Running 44 tests from 8 test suites.
[----------] Global test environment set-up.
[----------] 1 test from DataExplorer
[ RUN ] DataExplorer.executeQuery
[ OK ] DataExplorer.executeQuery (0 ms)
(...)
[ RUN ] UserInterfaceTest.GetQueryOperationWrongAggregatingColumn
[ OK ] UserInterfaceTest.GetQueryOperationWrongAggregatingColumn (0 ms)
[----------] 7 tests from UserInterfaceTest (0 ms total)
[----------] Global test environment tear-down
[==========] 44 tests from 8 test suites ran. (1 ms total)
[ PASSED ] 44 tests.
As an alternative, CTest
can be used to run tests from build
directory:
$ ctest
Test project <path>/data-explorer/build
Start 1: DataExplorer.executeQuery
1/44 Test #1: DataExplorer.executeQuery ................................... Passed 0.00 sec
(...)
Start 44: UserInterfaceTest.GetQueryOperationWrongAggregatingColumn
44/44 Test #44: UserInterfaceTest.GetQueryOperationWrongAggregatingColumn ... Passed 0.00 sec
100% tests passed, 0 tests failed out of 44
Total Test time (real) = 0.10 sec
As speed is the most important expectation from the task, there was some optimization was performed. Ones with the biggest impact:
- Used std::unordered_map instead of std::map.
- Used std::vectors to store data and passed by const reference.
- Storing strings as mapped values (std::string <-> unsigned int) and usage of indexes for operations (performance and significant memory optimization).
- Minimized copying.
For bigger datasets and more sophisticated operations, the following enhancements might be viable:
- Usage for multithreading in calculations using std::async + std::future.
- Usage of MPI (make sense with more sophisticated calculations).
- GPU calculations (in case of more complex calculations).