Merge pull request #249 from jorangreef/master

Add guide: "Working with Different Filesystems"
nodejs · Oct 20, 2015 · 64ad664 · 64ad664
2 parents 886bc35 + 4aea269
commit 64ad664
Showing 1 changed file with 224 additions and 0 deletions.
diff --git a/locale/en/docs/guides/working-with-different-filesystems.md b/locale/en/docs/guides/working-with-different-filesystems.md
@@ -0,0 +1,224 @@
+---
+title: Working with Different Filesystems
+layout: docs.hbs
+---
+
+# Working with Different Filesystems
+
+Node exposes many features of the filesystem. But not all filesystems are alike.
+The following are suggested best practices to keep your code simple and safe
+when working with different filesystems.
+
+## Filesystem Behavior
+
+Before you can work with a filesystem, you need to know how it behaves.
+Different filesystems behave differently and have more or less features than
+others: case sensitivity, case insensitivity, case preservation, Unicode form
+preservation, timestamp resolution, extended attributes, inodes, Unix
+permissions, alternate data streams etc.
+
+Be wary of inferring filesystem behavior from `process.platform`. For example,
+do not assume that because your program is running on Darwin that you are
+therefore working on a case-insensitive filesystem (HFS+), as the user may be
+using a case-sensitive filesystem (HFSX). Similarly, do not assume that because
+your program is running on Linux that you are therefore working on a filesystem
+which supports Unix permissions and inodes, as you may be on a particular
+external drive, USB or network drive which does not.
+
+The operating system may not make it easy to infer filesystem behavior, but all
+is not lost. Instead of keeping a list of every known filesystem and behavior
+(which is always going to be incomplete), you can probe the filesystem to see
+how it actually behaves. The presence or absence of certain features which are
+easy to probe, are often enough to infer the behavior of other features which
+are more difficult to probe.
+
+Remember that some users may have different filesystems mounted at various paths
+in the working tree.
+
+## Avoid a Lowest Common Denominator Approach
+
+You might be tempted to make your program act like a lowest common denominator
+filesystem, by normalizing all filenames to uppercase, normalizing all filenames
+to NFC Unicode form, and normalizing all file timestamps to say 1-second
+resolution. This would be the lowest common denominator approach.
+
+Do not do this. You would only be able to interact safely with a filesystem
+which has the exact same lowest common denominator characteristics in every
+respect. You would be unable to work with more advanced filesystems in the way
+that users expect, and you would run into filename or timestamp collisions. You
+would most certainly lose and corrupt user data through a series of complicated
+dependent events, and you would create bugs that would be difficult if not
+impossible to solve.
+
+What happens when you later need to support a filesystem that only has 2-second
+or 24-hour timestamp resolution? What happens when the Unicode standard advances
+to include a slightly different normalization algorithm (as has happened in the
+past)?
+
+A lowest common denominator approach would tend to try to create a portable
+program by using only "portable" system calls. This leads to programs that are
+leaky and not in fact portable.
+
+## Adopt a Superset Approach
+
+Make the best use of each platform you support by adopting a superset approach.
+For example, a portable backup program should sync btimes (the created time of a
+file or folder) correctly between Windows systems, and should not destroy or
+alter btimes, even though btimes are not supported on Linux systems. The same
+portable backup program should sync Unix permissions correctly between Linux
+systems, and should not destroy or alter Unix permissions, even though Unix
+permissions are not supported on Windows systems.
+
+Handle different filesystems by making your program act like a more advanced
+filesystem. Support a superset of all possible features: case-sensitivity,
+case-preservation, Unicode form sensitivity, Unicode form preservation, Unix
+permissions, high-resolution nanosecond timestamps, extended attributes etc.
+
+Once you have case-preservation in your program, you can always implement
+case-insensitivity if you need to interact with a case-insensitive filesystem.
+But if you forego case-preservation in your program, you cannot interact safely
+with a case-preserving filesystem. The same is true for Unicode form
+preservation and timestamp resolution preservation.
+
+If a filesystem provides you with a filename in a mix of lowercase and
+uppercase, then keep the filename in the exact case given. If a filesystem
+provides you with a filename in mixed Unicode form or NFC or NFD (or NFKC or
+NFKD), then keep the filename in the exact byte sequence given. If a filesystem
+provides you with a millisecond timestamp, then keep the timestamp in
+millisecond resolution.
+
+When you work with a lesser filesystem, you can always downsample appropriately,
+with comparison functions as required by the behavior of the filesystem on which
+your program is running. If you know that the filesystem does not support Unix
+permissions, then you should not expect to read the same Unix permissions you
+write. If you know that the filesystem does not preserve case, then you should
+be prepared to see `ABC` in a directory listing when your program creates `abc`.
+But if you know that the filesystem does preserve case, then you should consider
+`ABC` to be a different filename to `abc`, when detecting file renames or if the
+filesystem is case-sensitive.
+
+## Case Preservation
+
+You may create a directory called `test/abc` and be surprised to see sometimes
+that `fs.readdir('test')` returns `['ABC']`. This is not a bug in Node. Node
+returns the filename as the filesystem stores it, and not all filesystems
+support case-preservation. Some filesystems convert all filenames to uppercase
+(or lowercase).
+
+## Unicode Form Preservation
+
+*Case preservation and Unicode form preservation are similar concepts. To
+understand why Unicode form should be preserved , make sure that you first
+understand why case should be preserved. Unicode form preservation is just as
+simple when understood correctly.*
+
+Unicode can encode the same characters using several different byte sequences.
+Several strings may look the same, but have different byte sequences. When
+working with UTF-8 strings, be careful that your expectations are in line with
+how Unicode works. Just as you would not expect all UTF-8 characters to encode
+to a single byte, you should not expect several UTF-8 strings that look the same
+to the human eye to have the same byte representation. This may be an
+expectation that you can have of ASCII, but not of UTF-8.
+
+You may create a directory called `test/café` (NFC Unicode form with byte
+sequence `<63 61 66 c3 a9>` and `string.length === 5`) and be surprised to see
+sometimes that `fs.readdir('test')` returns `['café']` (NFD Unicode form with
+byte sequence `<63 61 66 65 cc 81>` and `string.length === 6`). This is not a
+bug in Node. Node returns the filename as the filesystem stores it, and not all
+filesystems support Unicode form preservation.
+
+HFS+, for example, will normalize all filenames to a form almost always the same
+as NFD form. Do not expect HFS+ to behave the same as NTFS or EXT4 and
+vice-versa. Do not try to change data permanently through normalization as a
+leaky abstraction to paper over Unicode differences between filesystems. This
+would create problems without solving any. Rather, preserve Unicode form and use
+normalization as a comparison function only.
+
+## Unicode Form Insensitivity
+
+Unicode form insensitivity and Unicode form preservation are two different
+filesystem behaviors often mistaken for each other. Just as case-insensitivity
+has sometimes been incorrectly implemented by permanently normalizing filenames
+to uppercase when storing and transmitting filenames, so Unicode form
+insensitivity has sometimes been incorrectly implemented by permanently
+normalizing filenames to a certain Unicode form (NFD in the case of HFS+) when
+storing and transmitting filenames. It is possible and much better to implement
+Unicode form insensitivity without sacrificing Unicode form preservation, by
+using Unicode normalization for comparison only.
+
+## Comparing Different Unicode Forms
+
+Node provides `string.normalize('NFC' / 'NFD')` which you can use to normalize a
+UTF-8 string to either NFC or NFD. You should never store the output from this
+function but only use it as part of a comparison function to test whether two
+UTF-8 strings would look the same to the user.
+
+You can use `string1.normalize('NFC') === string2.normalize('NFC')` or
+`string1.normalize('NFD') === string2.normalize('NFD')` as your comparison
+function. Which form you use does not matter.
+
+Normalization is fast but you may want to use a cache as input to your
+comparison function to avoid normalizing the same string many times over. If the
+string is not present in the cache then normalize it and cache it. Be careful
+not to store or persist the cache, use it only as a cache.
+
+Note that using `normalize()` requires that your version of Node include ICU
+(otherwise `normalize()` will just return the original string). If you download
+the latest version of Node from the website then it will include ICU.
+
+## Timestamp Resolution
+
+You may set the `mtime` (the modified time) of a file to `1444291759414`
+(millisecond resolution) and be surprised to see sometimes that `fs.stat`
+returns the new mtime as `1444291759000` (1-second resolution) or
+`1444291758000` (2-second resolution). This is not a bug in Node. Node returns
+the timestamp as the filesystem stores it, and not all filesystems support
+nanosecond, millisecond or 1-second timestamp resolution. Some filesystems even
+have very coarse resolution for the atime timestamp in particular, e.g. 24 hours
+for some FAT filesystems.
+
+## Do Not Corrupt Filenames and Timestamps Through Normalization
+
+Filenames and timestamps are user data. Just as you would never automatically
+rewrite user file data to uppercase the data or normalize `CRLF` to `LF`
+line-endings, so you should never change, interfere or corrupt filenames or
+timestamps through case / Unicode form / timestamp normalization. Normalization
+should only ever be used for comparison, never for altering data.
+
+Normalization is effectively a lossy hash code. You can use it to test for
+certain kinds of equivalence (e.g. do several strings look the same even though
+they have different byte sequences) but you can never use it as a substitute for
+the actual data. Your program should pass on filename and timestamp data as is.
+
+Your program can create new data in NFC (or in any combination of Unicode form
+it prefers) or with a lowercase or uppercase filename, or with a 2-second
+resolution timestamp, but your program should not corrupt existing user data by
+imposing case / Unicode form / timestamp normalization. Rather, adopt a superset
+approach and preserve case, Unicode form and timestamp resolution in your
+program. That way, you will be able to interact safely with filesystems which do
+the same.
+
+## Use Normalization Comparison Functions Appropriately
+
+Make sure that you use case / Unicode form / timestamp comparison functions
+appropriately. Do not use a case-insensitive filename comparison function if you
+are working on a case-sensitive filesystem. Do not use a Unicode form
+insensitive comparison function if you are working on a Unicode form sensitive
+filesystem (e.g. NTFS and most Linux filesystems which preserve both NFC and NFD
+or mixed Unicode forms). Do not compare timestamps at 2-second resolution if you
+are working on a nanosecond timestamp resolution filesystem.
+
+## Be Prepared for Slight Differences in Comparison Functions
+
+Be careful that your comparison functions match those of the filesystem (or
+probe the filesystem if possible to see how it would actually compare).
+Case-insensitivity for example is more complex than a simple `toLowerCase()`
+comparison. In fact, `toUpperCase()` is usually better than `toLowerCase()`
+(since it handles certain foreign language characters differently). But better
+still would be to probe the filesystem since every filesystem has its own case
+comparison table baked in.
+
+As an example, Apple's HFS+ normalizes filenames to NFD form but this NFD form
+is actually an older version of the current NFD form and may sometimes be
+slightly different from the latest Unicode standard's NFD form. Do not expect
+HFS+ NFD to be exactly the same as Unicode NFD all the time.