-
Notifications
You must be signed in to change notification settings - Fork 6.3k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #249 from jorangreef/master
Add guide: "Working with Different Filesystems"
- Loading branch information
Showing
1 changed file
with
224 additions
and
0 deletions.
There are no files selected for viewing
224 changes: 224 additions & 0 deletions
224
locale/en/docs/guides/working-with-different-filesystems.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,224 @@ | ||
--- | ||
title: Working with Different Filesystems | ||
layout: docs.hbs | ||
--- | ||
|
||
# Working with Different Filesystems | ||
|
||
Node exposes many features of the filesystem. But not all filesystems are alike. | ||
The following are suggested best practices to keep your code simple and safe | ||
when working with different filesystems. | ||
|
||
## Filesystem Behavior | ||
|
||
Before you can work with a filesystem, you need to know how it behaves. | ||
Different filesystems behave differently and have more or less features than | ||
others: case sensitivity, case insensitivity, case preservation, Unicode form | ||
preservation, timestamp resolution, extended attributes, inodes, Unix | ||
permissions, alternate data streams etc. | ||
|
||
Be wary of inferring filesystem behavior from `process.platform`. For example, | ||
do not assume that because your program is running on Darwin that you are | ||
therefore working on a case-insensitive filesystem (HFS+), as the user may be | ||
using a case-sensitive filesystem (HFSX). Similarly, do not assume that because | ||
your program is running on Linux that you are therefore working on a filesystem | ||
which supports Unix permissions and inodes, as you may be on a particular | ||
external drive, USB or network drive which does not. | ||
|
||
The operating system may not make it easy to infer filesystem behavior, but all | ||
is not lost. Instead of keeping a list of every known filesystem and behavior | ||
(which is always going to be incomplete), you can probe the filesystem to see | ||
how it actually behaves. The presence or absence of certain features which are | ||
easy to probe, are often enough to infer the behavior of other features which | ||
are more difficult to probe. | ||
|
||
Remember that some users may have different filesystems mounted at various paths | ||
in the working tree. | ||
|
||
## Avoid a Lowest Common Denominator Approach | ||
|
||
You might be tempted to make your program act like a lowest common denominator | ||
filesystem, by normalizing all filenames to uppercase, normalizing all filenames | ||
to NFC Unicode form, and normalizing all file timestamps to say 1-second | ||
resolution. This would be the lowest common denominator approach. | ||
|
||
Do not do this. You would only be able to interact safely with a filesystem | ||
which has the exact same lowest common denominator characteristics in every | ||
respect. You would be unable to work with more advanced filesystems in the way | ||
that users expect, and you would run into filename or timestamp collisions. You | ||
would most certainly lose and corrupt user data through a series of complicated | ||
dependent events, and you would create bugs that would be difficult if not | ||
impossible to solve. | ||
|
||
What happens when you later need to support a filesystem that only has 2-second | ||
or 24-hour timestamp resolution? What happens when the Unicode standard advances | ||
to include a slightly different normalization algorithm (as has happened in the | ||
past)? | ||
|
||
A lowest common denominator approach would tend to try to create a portable | ||
program by using only "portable" system calls. This leads to programs that are | ||
leaky and not in fact portable. | ||
|
||
## Adopt a Superset Approach | ||
|
||
Make the best use of each platform you support by adopting a superset approach. | ||
For example, a portable backup program should sync btimes (the created time of a | ||
file or folder) correctly between Windows systems, and should not destroy or | ||
alter btimes, even though btimes are not supported on Linux systems. The same | ||
portable backup program should sync Unix permissions correctly between Linux | ||
systems, and should not destroy or alter Unix permissions, even though Unix | ||
permissions are not supported on Windows systems. | ||
|
||
Handle different filesystems by making your program act like a more advanced | ||
filesystem. Support a superset of all possible features: case-sensitivity, | ||
case-preservation, Unicode form sensitivity, Unicode form preservation, Unix | ||
permissions, high-resolution nanosecond timestamps, extended attributes etc. | ||
|
||
Once you have case-preservation in your program, you can always implement | ||
case-insensitivity if you need to interact with a case-insensitive filesystem. | ||
But if you forego case-preservation in your program, you cannot interact safely | ||
with a case-preserving filesystem. The same is true for Unicode form | ||
preservation and timestamp resolution preservation. | ||
|
||
If a filesystem provides you with a filename in a mix of lowercase and | ||
uppercase, then keep the filename in the exact case given. If a filesystem | ||
provides you with a filename in mixed Unicode form or NFC or NFD (or NFKC or | ||
NFKD), then keep the filename in the exact byte sequence given. If a filesystem | ||
provides you with a millisecond timestamp, then keep the timestamp in | ||
millisecond resolution. | ||
|
||
When you work with a lesser filesystem, you can always downsample appropriately, | ||
with comparison functions as required by the behavior of the filesystem on which | ||
your program is running. If you know that the filesystem does not support Unix | ||
permissions, then you should not expect to read the same Unix permissions you | ||
write. If you know that the filesystem does not preserve case, then you should | ||
be prepared to see `ABC` in a directory listing when your program creates `abc`. | ||
But if you know that the filesystem does preserve case, then you should consider | ||
`ABC` to be a different filename to `abc`, when detecting file renames or if the | ||
filesystem is case-sensitive. | ||
|
||
## Case Preservation | ||
|
||
You may create a directory called `test/abc` and be surprised to see sometimes | ||
that `fs.readdir('test')` returns `['ABC']`. This is not a bug in Node. Node | ||
returns the filename as the filesystem stores it, and not all filesystems | ||
support case-preservation. Some filesystems convert all filenames to uppercase | ||
(or lowercase). | ||
|
||
## Unicode Form Preservation | ||
|
||
*Case preservation and Unicode form preservation are similar concepts. To | ||
understand why Unicode form should be preserved , make sure that you first | ||
understand why case should be preserved. Unicode form preservation is just as | ||
simple when understood correctly.* | ||
|
||
Unicode can encode the same characters using several different byte sequences. | ||
Several strings may look the same, but have different byte sequences. When | ||
working with UTF-8 strings, be careful that your expectations are in line with | ||
how Unicode works. Just as you would not expect all UTF-8 characters to encode | ||
to a single byte, you should not expect several UTF-8 strings that look the same | ||
to the human eye to have the same byte representation. This may be an | ||
expectation that you can have of ASCII, but not of UTF-8. | ||
|
||
You may create a directory called `test/café` (NFC Unicode form with byte | ||
sequence `<63 61 66 c3 a9>` and `string.length === 5`) and be surprised to see | ||
sometimes that `fs.readdir('test')` returns `['café']` (NFD Unicode form with | ||
byte sequence `<63 61 66 65 cc 81>` and `string.length === 6`). This is not a | ||
bug in Node. Node returns the filename as the filesystem stores it, and not all | ||
filesystems support Unicode form preservation. | ||
|
||
HFS+, for example, will normalize all filenames to a form almost always the same | ||
as NFD form. Do not expect HFS+ to behave the same as NTFS or EXT4 and | ||
vice-versa. Do not try to change data permanently through normalization as a | ||
leaky abstraction to paper over Unicode differences between filesystems. This | ||
would create problems without solving any. Rather, preserve Unicode form and use | ||
normalization as a comparison function only. | ||
|
||
## Unicode Form Insensitivity | ||
|
||
Unicode form insensitivity and Unicode form preservation are two different | ||
filesystem behaviors often mistaken for each other. Just as case-insensitivity | ||
has sometimes been incorrectly implemented by permanently normalizing filenames | ||
to uppercase when storing and transmitting filenames, so Unicode form | ||
insensitivity has sometimes been incorrectly implemented by permanently | ||
normalizing filenames to a certain Unicode form (NFD in the case of HFS+) when | ||
storing and transmitting filenames. It is possible and much better to implement | ||
Unicode form insensitivity without sacrificing Unicode form preservation, by | ||
using Unicode normalization for comparison only. | ||
|
||
## Comparing Different Unicode Forms | ||
|
||
Node provides `string.normalize('NFC' / 'NFD')` which you can use to normalize a | ||
UTF-8 string to either NFC or NFD. You should never store the output from this | ||
function but only use it as part of a comparison function to test whether two | ||
UTF-8 strings would look the same to the user. | ||
|
||
You can use `string1.normalize('NFC') === string2.normalize('NFC')` or | ||
`string1.normalize('NFD') === string2.normalize('NFD')` as your comparison | ||
function. Which form you use does not matter. | ||
|
||
Normalization is fast but you may want to use a cache as input to your | ||
comparison function to avoid normalizing the same string many times over. If the | ||
string is not present in the cache then normalize it and cache it. Be careful | ||
not to store or persist the cache, use it only as a cache. | ||
|
||
Note that using `normalize()` requires that your version of Node include ICU | ||
(otherwise `normalize()` will just return the original string). If you download | ||
the latest version of Node from the website then it will include ICU. | ||
|
||
## Timestamp Resolution | ||
|
||
You may set the `mtime` (the modified time) of a file to `1444291759414` | ||
(millisecond resolution) and be surprised to see sometimes that `fs.stat` | ||
returns the new mtime as `1444291759000` (1-second resolution) or | ||
`1444291758000` (2-second resolution). This is not a bug in Node. Node returns | ||
the timestamp as the filesystem stores it, and not all filesystems support | ||
nanosecond, millisecond or 1-second timestamp resolution. Some filesystems even | ||
have very coarse resolution for the atime timestamp in particular, e.g. 24 hours | ||
for some FAT filesystems. | ||
|
||
## Do Not Corrupt Filenames and Timestamps Through Normalization | ||
|
||
Filenames and timestamps are user data. Just as you would never automatically | ||
rewrite user file data to uppercase the data or normalize `CRLF` to `LF` | ||
line-endings, so you should never change, interfere or corrupt filenames or | ||
timestamps through case / Unicode form / timestamp normalization. Normalization | ||
should only ever be used for comparison, never for altering data. | ||
|
||
Normalization is effectively a lossy hash code. You can use it to test for | ||
certain kinds of equivalence (e.g. do several strings look the same even though | ||
they have different byte sequences) but you can never use it as a substitute for | ||
the actual data. Your program should pass on filename and timestamp data as is. | ||
|
||
Your program can create new data in NFC (or in any combination of Unicode form | ||
it prefers) or with a lowercase or uppercase filename, or with a 2-second | ||
resolution timestamp, but your program should not corrupt existing user data by | ||
imposing case / Unicode form / timestamp normalization. Rather, adopt a superset | ||
approach and preserve case, Unicode form and timestamp resolution in your | ||
program. That way, you will be able to interact safely with filesystems which do | ||
the same. | ||
|
||
## Use Normalization Comparison Functions Appropriately | ||
|
||
Make sure that you use case / Unicode form / timestamp comparison functions | ||
appropriately. Do not use a case-insensitive filename comparison function if you | ||
are working on a case-sensitive filesystem. Do not use a Unicode form | ||
insensitive comparison function if you are working on a Unicode form sensitive | ||
filesystem (e.g. NTFS and most Linux filesystems which preserve both NFC and NFD | ||
or mixed Unicode forms). Do not compare timestamps at 2-second resolution if you | ||
are working on a nanosecond timestamp resolution filesystem. | ||
|
||
## Be Prepared for Slight Differences in Comparison Functions | ||
|
||
Be careful that your comparison functions match those of the filesystem (or | ||
probe the filesystem if possible to see how it would actually compare). | ||
Case-insensitivity for example is more complex than a simple `toLowerCase()` | ||
comparison. In fact, `toUpperCase()` is usually better than `toLowerCase()` | ||
(since it handles certain foreign language characters differently). But better | ||
still would be to probe the filesystem since every filesystem has its own case | ||
comparison table baked in. | ||
|
||
As an example, Apple's HFS+ normalizes filenames to NFD form but this NFD form | ||
is actually an older version of the current NFD form and may sometimes be | ||
slightly different from the latest Unicode standard's NFD form. Do not expect | ||
HFS+ NFD to be exactly the same as Unicode NFD all the time. |