Skip to content

Commit

Permalink
Allow CharsetConverter to handle BOM skipping #483
Browse files Browse the repository at this point in the history
  • Loading branch information
nyamsprod committed Feb 7, 2023
1 parent 4b5d1df commit c8451b7
Show file tree
Hide file tree
Showing 24 changed files with 204 additions and 24 deletions.
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,10 @@ All Notable changes to `Csv` will be documented in this file

### Added

- `League\Csv\TabularDataWriter` interface to represent how to write to a tabular data document.
- `TabularDataWriter` interface to represent how to write to a tabular data document.
- `TabularDataReader::first` to replace `TabularDataReader::fetchOne`
- `TabularDataReader::nth` to replace `TabularDataReader::fetchOne`
- `CharsetConverter::skipBOM` to improve BOM skipping see [bug #483](https://github.com/thephpleague/csv/issues/483)

### Deprecated

Expand Down
10 changes: 5 additions & 5 deletions composer.json
Original file line number Diff line number Diff line change
Expand Up @@ -33,13 +33,13 @@
"require-dev": {
"ext-dom": "*",
"doctrine/collections": "^2.1.2",
"friendsofphp/php-cs-fixer": "^v3.13.2",
"phpbench/phpbench": "^1.2.7",
"phpstan/phpstan": "^1.9.6",
"friendsofphp/php-cs-fixer": "^v3.14.3",
"phpbench/phpbench": "^1.2.8",
"phpstan/phpstan": "^1.9.16",
"phpstan/phpstan-deprecation-rules": "^1.1.1",
"phpstan/phpstan-phpunit": "^1.3.3",
"phpstan/phpstan-strict-rules": "^1.4.4",
"phpunit/phpunit": "^9.5.27"
"phpstan/phpstan-strict-rules": "^1.4.5",
"phpunit/phpunit": "^9.6.3"
},
"autoload": {
"psr-4": {
Expand Down
10 changes: 7 additions & 3 deletions docs/9.0/connections/bom.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,12 +62,12 @@ bom_match('hello world!'.ByteSequence::BOM_UTF16_BE); //returns ''
public AbstractCsv::getInputBOM(void): string
```

The CSV document current BOM character is detected using the `getInputBOM` method. This method returns the currently used BOM character or an empty string if none is found or recognized. The detection is done using the `bom_match` function.
The CSV document current BOM character is detected using the `getInputBOM` method. This method returns the currently used BOM character or an empty string if none is found or recognized. The detection is done using the `Info::fetchBOMSequence` static method.

```php
use League\Csv\Writer;
use League\Csv\Reader;

$csv = Writer::createFromPath('/path/to/file.csv');
$csv = Reader::createFromPath('/path/to/file.csv');
$bom = $csv->getInputBOM();
```

Expand Down Expand Up @@ -128,3 +128,7 @@ The returned `$document` will contain **2** BOM markers instead of one.

<p class="message-warning">If you are using a <code>stream</code> that can not be seekable you should disable BOM skipping, otherwise an <code>Exception</code> will be triggered.</p>
<p class="message-warning">The BOM sequence is never removed from the CSV document, it is only skipped from the result set.</p>

### Skipping the BOM Sequence

<p class="message-info">Since version <code>9.9.0</code> you can skip the BOM sequence using the <a href="/9.0/interoperability/encoding/">CharsetConverter</a> filter</p>
5 changes: 3 additions & 2 deletions docs/9.0/connections/filters.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ title: Controlling PHP Stream Filters

# Stream Filters

To ease performing operations on the CSV document as it is being read from or written to, you can add PHP stream filters to the `Reader` and `Writer` objects.
To ease performing operations on the CSV document as it is being read from or written to, the `Reader` and `Writer` objects allow attaching PHP stream filters to them.

## Detecting stream filter support

Expand Down Expand Up @@ -143,7 +143,8 @@ $reader = null;

## Bundled stream filters

The library comes bundled with two (2) stream filters:
The library comes bundled with the following stream filters:

- [RFC4180Field](/9.0/interoperability/rfc4180-field/) stream filter to read or write RFC4180 compliant CSV field;
- [CharsetConverter](/9.0/converter/charset/) stream filter to convert your CSV document content using the `mbstring` extension;
- [SkipBOMSequence](/9.0/connections/bom/) stream filter to skip your CSV document BOM sequence if present;
30 changes: 30 additions & 0 deletions docs/9.0/interoperability/encoding.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,3 +55,33 @@ $writer->output('mycsvfile.csv');
```

<p class="message-info">The conversion is done with the <code>mbstring</code> extension using the <a href="/9.0/converter/charset/">League\Csv\CharsetConverter</a>.</p>

## Skipping The BOM sequence with the Reader class

<p class="message-info">Since version <code>9.9.0</code>.</p>

In order to ensure the correct removal of the sequence and avoid bugs while parsing the CSV, the filter can skip the
BOM sequence completely when using the `Reader` class and convert the CSV content from the BOM sequence encoding charset
to UTF-8. To work as intended call the `Reader::includeInputBOM` method to ensure the default BOM removal behaviour is disabled
and add the stream filter to you reader instance using the static method `CharsetConverter::skipBOM` method;

```php
<?php

use League\Csv\Reader;
use League\Csv\CharsetConverter;

$input = Reader::BOM_UTF16_BE."john,doe,john.doe@example.com\njane,doe,jane.doe@example.com\n";
$document = Reader::createFromString($input);
$document->includeInputBOM(); // de-activate the default skipping mechanism
CharsetConverter::allowBOMSkipping($document);
var_dump(iterator_to_array($document));
// returns the document content without the skipped BOM sequence
// [
// ['john', 'doe', 'john.doe@example.com'],
// ['jane', 'doe', 'jane.doe@example.com'],
// ]
```

<p class="message-warning">Once the filter is applied, the <code>Reader</code> instance looses all information regarding its
own BOM sequence. <strong>The sequence is still be present but the instance is no longer able to detect it</strong>.</p>
5 changes: 3 additions & 2 deletions src/AbstractCsv.php
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@

use Generator;
use SplFileObject;
use Stringable;
use function filter_var;
use function get_class;
use function mb_strlen;
Expand Down Expand Up @@ -94,9 +95,9 @@ public static function createFromStream($stream): static
/**
* Returns a new instance from a string.
*/
public static function createFromString(string $content = ''): static
public static function createFromString(Stringable|string $content = ''): static
{
return new static(Stream::createFromString($content));
return new static(Stream::createFromString((string) $content));
}

/**
Expand Down
2 changes: 2 additions & 0 deletions src/AbstractCsvTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@
* file that was distributed with this source code.
*/

declare(strict_types=1);

namespace League\Csv;

use PHPUnit\Framework\TestCase;
Expand Down
2 changes: 2 additions & 0 deletions src/CannotInsertRecordTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@
* file that was distributed with this source code.
*/

declare(strict_types=1);

namespace League\Csv;

use PHPUnit\Framework\TestCase;
Expand Down
44 changes: 36 additions & 8 deletions src/CharsetConverter.php
Original file line number Diff line number Diff line change
Expand Up @@ -36,18 +36,42 @@
class CharsetConverter extends php_user_filter
{
public const FILTERNAME = 'convert.league.csv';
public const BOM_SEQUENCE = 'bom_sequence';
public const SKIP_BOM_SEQUENCE = 'skip_bom_sequence';

protected string $input_encoding = 'UTF-8';
protected string $output_encoding = 'UTF-8';
protected bool $skipBomSequence = false;

/**
* Static method to add the stream filter to a {@link Reader} object to handle BOM skipping.
*/
public static function allowBOMSkipping(Reader $document): Reader
{
self::register();

$document->addStreamFilter(
self::getFiltername(match ($document->getInputBOM()) {
ByteSequence::BOM_UTF16_LE => 'UTF-16LE',
ByteSequence::BOM_UTF16_BE => 'UTF-16BE',
ByteSequence::BOM_UTF32_LE => 'UTF-32LE',
ByteSequence::BOM_UTF32_BE => 'UTF-32BE',
default => 'UTF-8',
}, 'UTF-8'),
[self::BOM_SEQUENCE => self::SKIP_BOM_SEQUENCE]
);

return $document;
}

/**
* Static method to add the stream filter to a {@link AbstractCsv} object.
*/
public static function addTo(AbstractCsv $csv, string $input_encoding, string $output_encoding): AbstractCsv
public static function addTo(AbstractCsv $csv, string $input_encoding, string $output_encoding, array $params = null): AbstractCsv
{
self::register();

return $csv->addStreamFilter(self::getFiltername($input_encoding, $output_encoding));
return $csv->addStreamFilter(self::getFiltername($input_encoding, $output_encoding), $params);
}

/**
Expand Down Expand Up @@ -110,23 +134,27 @@ public function onCreate(): bool
try {
$this->input_encoding = self::filterEncoding($matches['input']);
$this->output_encoding = self::filterEncoding($matches['output']);
$this->skipBomSequence = is_array($this->params)
&& isset($this->params[self::BOM_SEQUENCE])
&& self::SKIP_BOM_SEQUENCE === $this->params[self::BOM_SEQUENCE];
} catch (OutOfRangeException) {
return false;
}

return true;
}

/**
* @param resource $in
* @param resource $out
* @param int $consumed
*/
public function filter($in, $out, &$consumed, bool $closing): int
{
set_error_handler(fn (int $errno, string $errstr, string $errfile, int $errline) => true);
$alreadyRun = false;
while (null !== ($bucket = stream_bucket_make_writeable($in))) {
$bucket->data = mb_convert_encoding($bucket->data, $this->output_encoding, $this->input_encoding);
$content = $bucket->data;
if (!$alreadyRun && $this->skipBomSequence && null !== ($bom = Info::fetchBOMSequence($content))) {
$content = substr($content, strlen($bom));
}
$alreadyRun = true;
$bucket->data = mb_convert_encoding($content, $this->output_encoding, $this->input_encoding);
$consumed += $bucket->datalen;
stream_bucket_append($out, $bucket);
}
Expand Down
81 changes: 81 additions & 0 deletions src/CharsetConverterTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@
* file that was distributed with this source code.
*/

declare(strict_types=1);

namespace League\Csv;

use ArrayIterator;
Expand Down Expand Up @@ -196,4 +198,83 @@ public function converterProvider(): array
],
];
}

public function testItDoesNotChangeTheCSVContentIfNoBOMSequenceIsFound(): void
{
$data = <<<CSV
"start
end"
CSV;
$reader = Reader::createFromString($data);
CharsetConverter::allowBOMSkipping($reader);
$reader->includeInputBOM();

self::assertSame(
[['start
end']],
iterator_to_array($reader)
);
}

/**
* @dataProvider providesBOMSequences
*/
public function testItSkipBOMSequenceBeforeConsumingTheCSVStream(string $sequence): void
{
$data = <<<CSV
"start
end"
CSV;
$reader = Reader::createFromString($sequence.$data);
$reader->includeInputBOM();
CharsetConverter::allowBOMSkipping($reader);

self::assertSame(
[['start
end']],
iterator_to_array($reader)
);
}

/**
* @dataProvider providesBOMSequences
*/
public function testItOnlySkipOnceTheBOMSequenceBeforeConsumingTheCSVStreamOnMultipleLine(string $sequence): void
{
$data = <<<CSV
"{$sequence}start
end"
CSV;
$reader = Reader::createFromString($sequence.$data);
$reader->includeInputBOM();
CharsetConverter::allowBOMSkipping($reader);

self::assertSame(
[[$sequence.'start
end']],
iterator_to_array($reader)
);
}

/**
* @dataProvider providesBOMSequences
*/
public function testItOnlySkipOnceTheBOMSequenceBeforeConsumingTheCSVStreamOnSingleLine(string $sequence): void
{
$reader = Reader::createFromString($sequence.$sequence.'start,'.$sequence.'end');
CharsetConverter::allowBOMSkipping($reader);
$reader->includeInputBOM();

self::assertSame(
[[$sequence.'start', $sequence.'end']],
iterator_to_array($reader)
);
}

public function providesBOMSequences(): iterable
{
yield 'BOM UTF-8' => [
'sequence' => ByteSequence::BOM_UTF8,
];
}
}
2 changes: 2 additions & 0 deletions src/ColumnConsistencyTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@
* file that was distributed with this source code.
*/

declare(strict_types=1);

namespace League\Csv;

use PHPUnit\Framework\TestCase;
Expand Down
2 changes: 2 additions & 0 deletions src/EncloseFieldTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@
* file that was distributed with this source code.
*/

declare(strict_types=1);

namespace League\Csv;

use InvalidArgumentException;
Expand Down
2 changes: 2 additions & 0 deletions src/EscapeFormulaTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@
* file that was distributed with this source code.
*/

declare(strict_types=1);

namespace League\Csv;

use InvalidArgumentException;
Expand Down
2 changes: 2 additions & 0 deletions src/HTMLConverterTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@
* file that was distributed with this source code.
*/

declare(strict_types=1);

namespace League\Csv;

use DOMException;
Expand Down
2 changes: 2 additions & 0 deletions src/InfoTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@
* file that was distributed with this source code.
*/

declare(strict_types=1);

namespace League\Csv;

use PHPUnit\Framework\TestCase;
Expand Down
2 changes: 2 additions & 0 deletions src/RFC4180FieldTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@
* file that was distributed with this source code.
*/

declare(strict_types=1);

namespace League\Csv;

use InvalidArgumentException;
Expand Down
7 changes: 6 additions & 1 deletion src/Reader.php
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,12 @@ protected function setHeader(int $offset): array
return $header;
}

$header = $this->removeBOM($header, mb_strlen($this->getInputBOM()), $this->enclosure);
$header = $this->removeBOM(
$header,
!$this->is_input_bom_included ? mb_strlen($this->getInputBOM()) : 0,
$this->enclosure
);

if ([''] === $header) {
throw SyntaxError::dueToHeaderNotFound($offset);
}
Expand Down
2 changes: 2 additions & 0 deletions src/ReaderTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@
* file that was distributed with this source code.
*/

declare(strict_types=1);

namespace League\Csv;

use PHPUnit\Framework\TestCase;
Expand Down
Loading

0 comments on commit c8451b7

Please sign in to comment.