Skip to content

Commit

Permalink
MINIFICPP-2465 SplitContent
Browse files Browse the repository at this point in the history
  • Loading branch information
martinzink committed Oct 3, 2024
1 parent 0359e1b commit bcd607b
Show file tree
Hide file tree
Showing 8 changed files with 739 additions and 7 deletions.
35 changes: 35 additions & 0 deletions PROCESSORS.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,7 @@ limitations under the License.
- [RetryFlowFile](#RetryFlowFile)
- [RouteOnAttribute](#RouteOnAttribute)
- [RouteText](#RouteText)
- [SplitContent](#SplitContent)
- [SplitText](#SplitText)
- [TailEventLog](#TailEventLog)
- [TailFile](#TailFile)
Expand Down Expand Up @@ -2833,6 +2834,40 @@ In the list below, the names of required properties appear in bold. Any other pr
| success | All files, containing log events, are routed to success |


## SplitContent

### Description

Splits incoming FlowFiles by a specified byte sequence

### Properties

In the list below, the names of required properties appear in bold. Any other properties (not in bold) are considered optional. The table also indicates any default values, and whether a property supports the NiFi Expression Language.

| Name | Default Value | Allowable Values | Description |
|----------------------------|---------------|----------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Byte Sequence Format** | Hexadecimal | Hexadecimal<br/>Text | Specifies how the <Byte Sequence> property should be interpreted |
| **Byte Sequence** | | | A representation of bytes to look for and upon which to split the source file into separate files |
| **Keep Byte Sequence** | false | true<br/>false | Determines whether or not the Byte Sequence should be included with each Split |
| **Byte Sequence Location** | Trailing | Trailing<br/>Leading | If <Keep Byte Sequence> is set to true, specifies whether the byte sequence should be added to the end of the first split or the beginning of the second; if <Keep Byte Sequence> is false, this property is ignored. |

### Relationships

| Name | Description |
|----------|------------------------------------------------------|
| original | The original file |
| splits | All Splits will be routed to the splits relationship |

### Output Attributes

| Attribute | Relationship | Description |
|---------------------------|--------------|--------------------------------------------------------------------------------------------------------------------------------|
| fragment.identifier | | All split FlowFiles produced from the same parent FlowFile will have the same randomly generated UUID added for this attribute |
| fragment.index | | A one-up number that indicates the ordering of the split FlowFiles that were created from a single parent FlowFile |
| fragment.count | | The number of split FlowFiles generated from the parent FlowFile |
| segment.original.filename | | The filename of the parent FlowFile |


## SplitText

### Description
Expand Down
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,9 +63,9 @@ MiNiFi - C++ supports the following C++ processors:

The following table lists the base set of processors.

| Extension Set | Processors |
|---------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Base** | [AppendHostInfo](PROCESSORS.md#appendhostinfo)<br/>[AttributesToJSON](PROCESSORS.md#attributestojson)<br/>[DefragmentText](PROCESSORS.md#defragmenttext)<br/>[ExecuteProcess](PROCESSORS.md#executeprocess)<br/>[ExtractText](PROCESSORS.md#extracttext)<br/>[FetchFile](PROCESSORS.md#fetchfile)<br/>[GenerateFlowFile](PROCESSORS.md#generateflowfile)<br/>[GetFile](PROCESSORS.md#getfile)<br/>[GetTCP](PROCESSORS.md#gettcp)<br/>[HashContent](PROCESSORS.md#hashcontent)<br/>[InvokeHTTP](PROCESSORS.md#invokehttp)<br/>[ListenSyslog](PROCESSORS.md#listensyslog)<br/>[ListenTCP](PROCESSORS.md#listentcp)<br/>[ListenUDP](PROCESSORS.md#listenudp)<br/>[ListFile](PROCESSORS.md#listfile)<br/>[LogAttribute](PROCESSORS.md#logattribute)<br/>[PutFile](PROCESSORS.md#putfile)<br/>[PutTCP](PROCESSORS.md#puttcp)<br/>[PutUDP](PROCESSORS.md#putudp)<br/>[ReplaceText](PROCESSORS.md#replacetext)<br/>[RetryFlowFile](PROCESSORS.md#retryflowfile)<br/>[RouteOnAttribute](PROCESSORS.md#routeonattribute)<br/>[RouteText](PROCESSORS.md#routetext)<br/>[SplitText](PROCESSORS.md#splittext)<br/>[TailFile](PROCESSORS.md#tailfile)<br/>[UpdateAttribute](PROCESSORS.md#updateattribute) |
| Extension Set | Processors |
|---------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Base** | [AppendHostInfo](PROCESSORS.md#appendhostinfo)<br/>[AttributesToJSON](PROCESSORS.md#attributestojson)<br/>[DefragmentText](PROCESSORS.md#defragmenttext)<br/>[ExecuteProcess](PROCESSORS.md#executeprocess)<br/>[ExtractText](PROCESSORS.md#extracttext)<br/>[FetchFile](PROCESSORS.md#fetchfile)<br/>[GenerateFlowFile](PROCESSORS.md#generateflowfile)<br/>[GetFile](PROCESSORS.md#getfile)<br/>[GetTCP](PROCESSORS.md#gettcp)<br/>[HashContent](PROCESSORS.md#hashcontent)<br/>[InvokeHTTP](PROCESSORS.md#invokehttp)<br/>[ListenSyslog](PROCESSORS.md#listensyslog)<br/>[ListenTCP](PROCESSORS.md#listentcp)<br/>[ListenUDP](PROCESSORS.md#listenudp)<br/>[ListFile](PROCESSORS.md#listfile)<br/>[LogAttribute](PROCESSORS.md#logattribute)<br/>[PutFile](PROCESSORS.md#putfile)<br/>[PutTCP](PROCESSORS.md#puttcp)<br/>[PutUDP](PROCESSORS.md#putudp)<br/>[ReplaceText](PROCESSORS.md#replacetext)<br/>[RetryFlowFile](PROCESSORS.md#retryflowfile)<br/>[RouteOnAttribute](PROCESSORS.md#routeonattribute)<br/>[RouteText](PROCESSORS.md#routetext)<br/>[SplitContent](PROCESSORS.md#splitcontent)<br/>[SplitText](PROCESSORS.md#splittext)<br/>[TailFile](PROCESSORS.md#tailfile)<br/>[UpdateAttribute](PROCESSORS.md#updateattribute) |

The next table outlines CMAKE flags that correspond with MiNiFi extensions. Extensions that are enabled by default ( such as RocksDB ), can be disabled with the respective CMAKE flag on the command line.

Expand Down
168 changes: 168 additions & 0 deletions extensions/standard-processors/processors/SplitContent.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
/**
*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#include "SplitContent.h"

#include <range/v3/view/split.hpp>

#include "core/ProcessContext.h"
#include "core/ProcessSession.h"
#include "core/Resource.h"
#include "core/FlowFile.h"
#include "utils/gsl.h"
#include "utils/ProcessorConfigUtils.h"

namespace org::apache::nifi::minifi::processors {

constexpr size_t BUFFER_TARGET_SIZE = 1024;

void SplitContent::initialize() {
setSupportedProperties(Properties);
setSupportedRelationships(Relationships);
}

void SplitContent::onSchedule(core::ProcessContext& context, core::ProcessSessionFactory&) {
auto byte_sequence_str = utils::getRequiredPropertyOrThrow<std::string>(context, ByteSequence.name);
const auto byte_sequence_format = utils::parseEnumProperty<ByteSequenceFormat>(context, ByteSequenceFormatProperty);
if (byte_sequence_format == ByteSequenceFormat::Hexadecimal) {
byte_sequence_ = utils::string::from_hex(byte_sequence_str);
} else {
byte_sequence_.resize(byte_sequence_str.size());
std::ranges::transform(byte_sequence_str, byte_sequence_.begin(), [] (char c) { return static_cast<std::byte>(c); });
}
byte_sequence_location_ = utils::parseEnumProperty<ByteSequenceLocation>(context, ByteSequenceLocationProperty);
keep_byte_sequence = utils::getRequiredPropertyOrThrow<bool>(context, KeepByteSequence.name);
}

std::shared_ptr<core::FlowFile> SplitContent::createNewSplit(core::ProcessSession& session) const {
auto next_split = session.create();
if (!next_split) {
throw Exception(PROCESSOR_EXCEPTION, "Couldn't create FlowFile");
}
if (keep_byte_sequence && byte_sequence_location_ == ByteSequenceLocation::Leading) {
session.appendBuffer(next_split, byte_sequence_);
}
return next_split;
}

void SplitContent::finalizeLatestSplitContent(core::ProcessSession& session, const std::shared_ptr<core::FlowFile>& latest_split, const std::vector<std::byte>& buffer) const {
const std::span<const std::byte> data_without_byte_sequence{buffer.data(), buffer.size() - byte_sequence_.size()};
session.appendBuffer(latest_split, data_without_byte_sequence);
if (keep_byte_sequence && byte_sequence_location_ == ByteSequenceLocation::Trailing) {
session.appendBuffer(latest_split, byte_sequence_);
}
}

void SplitContent::finalizeLastSplitContent(core::ProcessSession& session,
std::vector<std::shared_ptr<core::FlowFile>>& splits,
const std::vector<std::byte>& buffer,
const bool ended_with_byte_sequence) const {
if (ended_with_byte_sequence && splits.back()->getSize() != 0) {
if (keep_byte_sequence && byte_sequence_location_ == ByteSequenceLocation::Leading) {
const auto last_split = session.create();
if (!last_split) {
throw Exception(PROCESSOR_EXCEPTION, "Couldn't create FlowFile");
}
splits.push_back(last_split);
session.appendBuffer(splits.back(), byte_sequence_);
}
} else {
session.appendBuffer(splits.back(), buffer);
}
}

namespace {
std::shared_ptr<core::FlowFile> createFirstSplit(core::ProcessSession& session) {
auto first_split = session.create();
if (!first_split) {
throw Exception(PROCESSOR_EXCEPTION, "Couldn't create FlowFile");
}
return first_split;
}

void updateSplitAttributesAndTransfer(core::ProcessSession& session, const std::vector<std::shared_ptr<core::FlowFile>>& splits, const core::FlowFile& original) {
const std::string fragment_identifier_ = utils::IdGenerator::getIdGenerator()->generate().to_string();
for (size_t split_i = 0; split_i < splits.size(); ++split_i) {
const auto& split = splits[split_i];
split->setAttribute(SplitContent::FragmentCountOutputAttribute.name, std::to_string(splits.size()));
split->setAttribute(SplitContent::FragmentIndexOutputAttribute.name, std::to_string(split_i + 1)); // One based indexing
split->setAttribute(SplitContent::FragmentIdentifierOutputAttribute.name, fragment_identifier_);
split->setAttribute(SplitContent::SegmentOriginalFilenameOutputAttribute.name, original.getAttribute(core::SpecialFlowAttribute::FILENAME).value_or(""));
session.transfer(split, SplitContent::Splits);
}
}

bool lastSplitIsEmpty(const std::vector<std::shared_ptr<core::FlowFile>>& splits) {
return splits.back()->getSize() != 0;
}
} // namespace

void SplitContent::onTrigger(core::ProcessContext& context, core::ProcessSession& session) {
gsl_Assert(!byte_sequence_.empty());
const auto original = session.get();
if (!original) {
context.yield();
return;
}

const auto ff_content_stream = session.getFlowFileContentStream(*original);
if (!ff_content_stream) {
throw Exception(PROCESSOR_EXCEPTION, fmt::format("Couldn't access the ContentStream of {}", original->getUUID().to_string()));
}
std::vector<std::byte> buffer{};
buffer.reserve(BUFFER_TARGET_SIZE + byte_sequence_.size());
size_t matching_bytes = 0;
bool ended_with_byte_sequence = false;
std::vector<std::shared_ptr<core::FlowFile>> splits{};
splits.push_back(createFirstSplit(session));

while (auto latest_byte = ff_content_stream->readByte()) {
buffer.push_back(*latest_byte);
if (ended_with_byte_sequence) {
ended_with_byte_sequence = false;
if (lastSplitIsEmpty(splits)) {
splits.push_back(createNewSplit(session));
}
}
if (latest_byte == byte_sequence_[matching_bytes]) {
matching_bytes++;
if (matching_bytes == byte_sequence_.size()) {
// Found the Byte Sequence
finalizeLatestSplitContent(session, splits.back(), buffer);
ended_with_byte_sequence = true;
matching_bytes = 0;
buffer.clear();
}
} else {
matching_bytes = 0;
if (buffer.size() >= BUFFER_TARGET_SIZE) {
session.appendBuffer(splits.back(), buffer);
buffer.clear();
}
}
}

finalizeLastSplitContent(session, splits, buffer, ended_with_byte_sequence);

updateSplitAttributesAndTransfer(session, splits, *original);
session.transfer(original, Original);
}

REGISTER_RESOURCE(SplitContent, Processor);

} // namespace org::apache::nifi::minifi::processors
Loading

0 comments on commit bcd607b

Please sign in to comment.