Split a single Message Stream file into multiple files. Ideal for chunking a stream into smaller pieces for manageability of file sizes or upload runs to database, or for "grouping" lines into files based on properties or values
This is a gulp-etl plugin, and as such it is a gulp plugin. gulp-etl plugins processes ndjson data streams/files which we call Message Streams and which are compliant with the Singer specification. Message Streams look like this:
{"type": "SCHEMA", "stream": "users", "key_properties": ["id"], "schema": {"required": ["id"], "type": "object", "properties": {"id": {"type": "integer"}}}}
{"type": "RECORD", "stream": "users", "record": {"id": 1, "name": "Chris"}}
{"type": "RECORD", "stream": "users", "record": {"id": 2, "name": "Mike"}}
{"type": "SCHEMA", "stream": "locations", "key_properties": ["id"], "schema": {"required": ["id"], "type": "object", "properties": {"id": {"type": "integer"}}}}
{"type": "RECORD", "stream": "locations", "record": {"id": 1, "name": "Philadelphia"}}
{"type": "STATE", "value": {"users": 2, "locations": 1}}
const splitFile = require('gulp-etl-splitfile').splitFile; // javascript
import { splitFile } from 'gulp-etl-splitfile'; // typescript
gulp-etl plugins accept a configObj as its first parameter. The configObj will contain any info the plugin needs.
Available configObj properties for this plugin:
index:number
- The maximum number of lines in each new file. Cannot be combined withgroupBy
.
.pipe(splitFile({index:1000})) // Split out a new file every 1000 lines
.pipe(splitFile({groupBy:'.type', index:2 })) // cause error by using groupBy and index together
.pipe(splitFile({})) // default (no options): split out a new file for every line
groupBy:string|array
- Value(s) in lines to split lines between files; uses JSONSelect. Cannot be combined withindex
.
.pipe(splitFile({groupBy:'.type'})) // group by (split lines to new files based on) the value of the "type" property of each line
.pipe(splitFile({groupBy:['.type', ".stream"]})) // group by `type` and then `stream`
.pipe(splitFile({groupBy:'.record .name'})) // group by `record.name` property)
.pipe(splitFile({groupBy:'.record ."Last Name", .type:val("STATE")' })) // group by `record.Last Name`, and/or by `type` (if it is equal to "STATE")
separator:string
- Character(s) to separate sections of file names
// splitting `file.ndjson`
.pipe(splitFile({index:1000, separator:'_'})) // -> `file_0.ndjson`, `file_1.ndjson`... (this is the default)
.pipe(splitFile({index:1000, separator:'-'})) // -> `file-0.ndjson`, `file-1.ndjson`...
.pipe(splitFile({groupBy:'.type', separator:'-'})) // -> `file-SCHEMA.ndjson`, `file-RECORD.ndjson`...
.pipe(splitFile({groupBy:['.type', ".stream"]})) // -> `file_SCHEMA_users.ndjson`, `file-RECORD_users.ndjson`...
timeStamp:boolean
- Add a shortened string to all filenames based on the current time? use to keep successive runs from overwriting results from those before
.pipe(splitFile({index:1000, timeStamp:true })) // -> `file_l4514_fe_0.ndjson`, `file_l4514_fe_1.ndjson`...
- Dependencies:
- Clone this repo and run
npm install
to install npm packages - Debug: with VScode use
Open Folder
to open the project folder, then hit F5 to debug. This runs without compiling to javascript using ts-node - Test:
npm test
ornpm t
- Compile to javascript:
npm run build
- Run using included test data (be sure to build first):
gulp --gulpfile debug/gulpfile.ts
We are using Jest for our testing. Each of our tests are in the test
folder.
- Run
npm test
to run the test suites note: tests are currently broken
Note: This document is written in Markdown. We like to use Typora and Markdown Preview Plus for our Markdown work.