Skip to content
This repository has been archived by the owner on Apr 22, 2023. It is now read-only.

build: add i18n build option #6371

Closed
trevnorris opened this issue Oct 17, 2013 · 40 comments
Closed

build: add i18n build option #6371

trevnorris opened this issue Oct 17, 2013 · 40 comments
Milestone

Comments

@trevnorris
Copy link

This is something we'll have to address when we upgrade to v8 3.22 anyways, since it's enabled by default.

IMO it should be disabled by default, but add a configure flag like --with-{intl,icu,i18n} (choose one) that'll automatically take care of adding it to the build. Currently in v8 if you do a make dependencies it checks out a substantial amount of code for icu. I'm told that there's a much slimmed down version that can be used. Either way I don't think it should be bundled with Node by default. Instead it'll just grab the dependency before the build process.

I was able to build current master with i18n support by doing the following steps:

  • Added 'os_posix': 1 to the variables section of common.gypi.
  • Added 'v8_enable_i18n_support': 1 to the variables section of common.gypi.
  • Run the following commands:
cd deps/v8
make dependencies
cd ../../
ln -s deps/v8/build/
mv deps/v8/third_party .
./configure
make -j8

This gave me a build of Node that had the Intl global object.

possible dependency: #6370
extension of: #4689

/cc @tjfontaine @TooTallNate @davglass

@bnoordhuis
Copy link
Member

I don't think you have to set os_posix, deps/v8/build/standalone.gypi is supposed to figure it out by itself, depending on what the OS variable is set to.

@trevnorris
Copy link
Author

@bnoordhuis configure game me that error, or else honestly I wouldn't have even known it was there. :)

@bnoordhuis
Copy link
Member

Hrm, weird. Now that I think about it, os_posix is only used in standalone.gypi but node doesn't use that file. At least, it shouldn't - we don't build v8 binaries.

@davglass
Copy link

Ping @trevnorris, any update on this?

@trevnorris
Copy link
Author

@davglass Master is now using v8 3.22, but we've disabled i18n by default.

@isaacs @bnoordhuis Do we have any plans to include basic icu support? if nothing else as a build flag?

@tjfontaine
Copy link

I see no reason we couldn't make it a configure flag

@davglass
Copy link

That's all we are really looking for is a "supported" configure flag without all the little hacks :)

@bnoordhuis
Copy link
Member

We could make it a configure flag. Caveat emptor, V8 currently only builds against the libicu that's bundled with chromium.

bnoordhuis added a commit to bnoordhuis/node that referenced this issue Nov 21, 2013
Adds a --with-icu-path= switch to the configure script.  Requires that
the user checks out the copy of libicu that's bundled with chromium to
a fixed directory.  It's still a little rough around the edges but it
works.

Fixes nodejs#6371.
@churchofthought
Copy link

Seems like i18n isn't really a necessity given the NumberFormat and Number.toLocalString
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Number/toLocaleString

@ericf
Copy link

ericf commented Dec 6, 2013

@iandrewfuchs toLocaleString was updated along with the creation of the other Intl.* APIs as part of Ecma 402 Internationalization which is part of ECMAScript 5.1. You can see these comparability details on the MDN page you linked to.

The idea is to bring Node's Internationalization support on par with Firefox, Chrome, and IE 11.

@lwelti
Copy link

lwelti commented Dec 10, 2013

this is marked as closed, this means we can now install a node v8 with the i18n support?

@trevnorris
Copy link
Author

@lwelti you'll have to build Node from the master branch using the options specified in the README.

@JannesMeyer
Copy link

You can't be serious in not including Unicode support? What kind of an USA-centric view is this you're propagating with your software?

I'm trying to build a Node.js web application with some nicely sorted sorted lists.

var list = [ 'Z', 'A', 'Á', 'Ä' ];
list.sort(function(a, b) {
  return a.localeCompare(b);
});
list;

Node.js (v0.11.12) output:

[ 'A', 'Z', 'Á', 'Ä' ]

Chrome (34.0.1847.116) output:

["A", "Á", "Ä", "Z"]

It took JavaScript long enough to evolve to a state where it has something like String.prototype.localeCompare() or the Intl.* APIs. Could you please elaborate what reasons you have for blocking access to those features that are existent in V8?

@trevnorris
Copy link
Author

@JannesMeyer You're blatant attitude is unwelcome. Next time just ask for a reason, rather than start off with accusations.

Node bundles all of its dependencies. This adds extra code weight, but it makes development and debugging much simpler. Especially in the few cases where we have to float a patch (f9ced08 for example).

Right now the only icu package v8 supports is https://src.chromium.org/chrome/trunk/deps/third_party/icu46. Which adds an additional 350MB of dependency files. Judging by the lack of support tickets or mailing list questions on the matter generally demonstrates this is a feature a small minority use, and this minority has thus far found it acceptable to build their own in order to enable this feature.

IIRC we discussed that once an icu package can be included that isn't so dammed large it would be acceptable to include it as a dependency.

Note: accidentally posted the wrong has in the floating patch case. this has been changed.

@JannesMeyer
Copy link

@trevnorris I am sorry for jumping to conclusions a bit too quickly. I guess that was unfair.

But on the other hand I think it's also unfair to assume that nobody wants these features just because nobody is finding the right place to ask questions about it (an English-speaking mailing list, English-speaking github issues, et cetera...) I think these problems disproportionally affect people whose primary language might be something else than English.

Furthermore, it took me about an hour to track this broken behavior down to a Node.js issue. At first I thought that there was a bug in my code, because in some cases the sorting of some Unicode characters actually seemed to work correctly. Other developers might not be that patient and just move on and leave their users with a less than optimal sorting behaviour or they might implement nasty workarounds.

I think that Unicode support indeed is a neccessity to provide a good user experience. Some people who are asking for it:

An effort to create a polyfill for Node.js: Intl.js: Node.js support and npm package
Which was eventually abandoned:

The Default Unicode Collation Element Table is huge, even after compression, and converting to a native JavaScript object would probably make it slightly larger. Server-side JavaScript environments will (hopefully) soon support Intl.Collator, and we can't really expect client environments to download this data.
Intl.js' README.md

And this is just for sorting. I'm not even talking about all the other Unicode issues which affect JavaScript in general: JavaScript has a Unicode problem

I understand your reasons for not wanting to pull in 350MB of source files directly into the Node repo. But why not make it a git submodule or an external dependency?

@caridy
Copy link

caridy commented Apr 24, 2014

For the sake of transparency, let's NOT use the 350MB argument, because that's not accurate. @NorbertLindenberg and @srl295 explained in details how this could simply be ~4MB. More details here: #4689 (comment)

@srl295
Copy link
Member

srl295 commented Apr 25, 2014

@trevnorris Hi, ICU developer here. Do you have a target size? I'm working on slicing things up, some idea of what's acceptable would be helpful.

(By the way - 350MB may be Chrome/Android/v8-i18n's size, but the actual ICU 4.6 source base - including source to the data, not even including prebuilt data- is only 88MB. With prebuilt data it is 70MB on disk. The way v8-i18n pulls it from the Chrome source base, it includes multiple copies of the prebuilt data. You can see for yourself at http://site.icu-project.org/download/46 or better yet http://site.icu-project.org/download/53 . One of my tasks is to allow v8-i18n to build from a normal ICU package. )

@srl295
Copy link
Member

srl295 commented Apr 25, 2014

@trevnorris so you can use the customizer tool at http://apps.icu-project.org/datacustom/ICUData46.html and deselect data items, and then download an ICU4C data file. Unzip the resulting file, drop it in as node/deps/v8/third_party/icu46/source/data/in/icudt46l.dat. .. and then, I am still figuring out the chrome/android build system, so the next steps aren't as simple.

@srl295
Copy link
Member

srl295 commented Apr 25, 2014

@trevnorris okay. four megabytes, that's the increase with just English, just by using the data customizer and building ICU tools on the side to rebuild the data file. (i.e. no specific code changes).

So the score:

12M node  ( x86_64, built from git in february sometime )
24M node  ( same, but built with i18n support as per README.md )
16M node  ( same, but with English-only support )

You can get the .S file at https://ssl.icu-project.org/~srl/tmp/icudt46l_dat.S_ENGLISH.tar.xz and then it replaces node/deps/v8/third_party/icu46/linux/icudt46l_dat.S. (note: the resultant node doesn't work, but probably I did something wrong building it. As I said below, Chrome's build system is unfamiliar to me as of yet, unlike ICU's own.)

I'm sure there is code and data that node doesn't need being included there, but I'd like to get the conversation going here.

@trevnorris
Copy link
Author

@JannesMeyer Fair enough about non-English speakers not being able to post their concerns. As far as allowing it to be an external dependency, instead of being included like all other dependencies, would have to be discussed with the remainder core team.

@caridy I'll concede that by removing third_party/icu/.svn directory after running make dependencies that download size decreases to 172MB. But if you'd read #4689 (comment) you'd notice it addresses the extra download size of the binary. Not the additional amount of code.

@srl295 As far as "target size" are you referring to additional size of the binary, or the additional amount of data that would be included in the repository?

AFAIK, no one on the core team opposes having ICU support. The issue is the fact we include all external dependencies, and including ICU for all supported platforms would include a massive amount of code.

@tjfontaine @indutny I'd like you're guys' feedback on this matter.

@srl295
Copy link
Member

srl295 commented Apr 25, 2014

@trevnorris as noted above - ICU4C 4.6 is a 15MB .tgz, unpacking to 70M on disk, or 88M from svn. I don't understand Chromium's build system or why there are six (6) copies of the data in their source tree in various odd formats, such as three versions of assembly language constants- all things I plan to address.

Is it acceptable to have the base have a 4MB increase ( as a first rough cut - can probably be improved ) and English-only support (which is 984k of that 4MB)? And some sort of mechanism to add other languages, or an alternate build containing 'everything' (which will add not more than 8MB additional).

@bnoordhuis
Copy link
Member

@srl295 As a data point, a release tarball is currently about 15 MB. Once extracted, it's around 85 MB. My initial objection against bundling libicu was that it would double both numbers.

If you can keep the increase down to, let's say 20 MB / 100 MB (not hard numbers, just guidelines), that would be perfectly acceptable to me. It'd be great to have i18n enabled by default.

@trevnorris Sounds reasonable?

@indutny
Copy link
Member

indutny commented Apr 25, 2014

I agree with @bnoordhuis.

But I think it should be possible to build node with any ICU configuration using ./configure flags. If users want 400 mb binary - they should get it. I wonder if it could be possible to add ICU support to v8 via addon?

@srl295
Copy link
Member

srl295 commented Apr 25, 2014

@bnoordhuis @indutny The max size of the binary increase is 12MB not 400MB. So ICU (thanks for using that name instead of "libicu") and node both have pretty comparable binary (12M) and source (15MB expanding to 85MB) sizes - please think about them as comparable. There's no 400Lb or 400MB gorilla.

Good question. The problem I see is balancing custom modifications to ICU vs. being able to just drop in a new version. If doubling the source size is acceptable, we can keep the complexity lower and thus ease of updating. What we can't do is have english-only data in the source and then enable additional data without hitting the network.

@thom4parisot
Copy link

What we can't do is have english-only data in the source and then enable additional data without hitting the network.

Could the ICU data be loaded through a npm module?
So as people could:

  • benefit from the Intl features natively;
  • disable the i18n features for space/performance issues (I'm thinking on cheap embedded devices);
  • upgrade ICU data without requiring the Node team to release updates just for that.

To update all ICU data:

npm update -g icu-data-all

To build without ICU data, think of --without-i18n as --without-npm acts with npm) and/or update ICU data to collect translation updates without having to update Node.

./configure --without-i18n && make && make install
npm install -g icu-data-fr

We could assume all icu-data-<lang> rely on an icu-data-core/icu-data-en peerDependencies.

I don't know the feasibility/difficulty of doing such a thing but well, it's my 2 cents on the problem, from a not-only-English-speaking-app developer ;-)

@srl295
Copy link
Member

srl295 commented Apr 25, 2014

@oncletom ICU's source data has 3,500 elements to it in 500+ locales (here I go w/ big numbers) - icu-data- may be too fine grained. Perhaps one module that just adds "everything" for <10MB?

ICU does have api such as [u_setDataDirectory][http://icu-project.org/apiref/icu4c/putil_8h.html#ab03241e29fc619a4b3f7737e98aa08ef] and [udata_setCommonData][http://icu-project.org/apiref/icu4c/udata_8h.html#a467bda719595adb58f959dde735e1153] which mean "go here for a nice .dat file to memory-map" and "here's a pointer to a bunch of readonly data to use" respectively. If these are called before any other ICU functionality is used, they could provide access to on-disk or any other access to one chunk of additional data. ( would need to be in v8 and then exposed to node of course ). I don't know enough about the start-up or dependency situation to know what the other layers would do, but it's possible from ICU's side.

@srl295
Copy link
Member

srl295 commented Apr 29, 2014

Update: I'm able to build node from an arbitrary (non-chrome) ICU.

@trevnorris
Copy link
Author

@srl295 that's good to hear. Now, how often will we have to update the ICU library? Just curious.

Also, it seems there is the issue of language support. How many languages do we support? etc. As @oncletom noted, it could be advantageous to allow users to install additional language support via npm.

It's late and I'm tired, so here go the stupid questions. Would it be possible to only include the code that allows ICU to work, but not include any language support (since it seems that is where the bulk of the size comes from)? And would it then be possible to install whatever language support someone desired via npm? Issue is that the person would still have to do some type of require('icu-support'); at the top.

Just crapping out ideas.

While it's convenient all dependencies are included, I'm annoyed that the deps/ folder is 85% of the size of the source tree. Anyways, rant over.

@srl295
Copy link
Member

srl295 commented Apr 29, 2014

@trevnorris if I've even partially done this right, the branch I'm working on is at https://github.com/srl295/node/compare/smaller-icu - can build from any ICU, but calls out to shell/makefile (just to get things moving). This is WIP/experimental/proof of concept, not a looming pull request.

Numbers: node tip is now 14MB for me as I write. I was able to get a node build with the latest ICU (from ICU trunk), + just english, at 18MB (so +4 still). With "everything" is 38MB, but, I have not tried "all locales, but only what Node actually will use". I would guess it would be on the order of +10MB from non-i18n.

how many languages? ICU out of the box supports ~200 languages, ~500 locales total. [CLDR][http://cldr.unicode.org] has some others that ICU doesn't take by default.

how often? ICU (and CLDR) update roughly yearly. Minor updates at other times of course. I'd say node could determine what schedule works, perhaps watching CLDR especially for any language change of interest.

*.. just the code that allows ICU to work.. * my "18MB count" above ( 4 MB over stock node ) is what I can build at 11PM that's node + all code needed + english/root data.

...install language support .. I don't know enough about the loading process in node or v8. As is, ICU has to be called first thing with a "here's some new data" function, not mid-operation. But, that's as-is, it's all just code. I would question how fine to slice the installed packages. Adding French, say, (in all territories) adds 67k. Of that, French in Canada is 4.5k. Hundreds of tiny packages seem like they could create more overhead than flexibility.

Maybe I can return with stupid questions of my own: is there any concept of persistent state, a "node.js home/configuration directory", etc where data could be loaded from? I think the answers are no, if not a resounding no.

Anyways, thanks for the thoughts here. Tomorrow I'll figure out what the upper limit is using "only stuff ecma402 cares about". Then we can really compare: 14MB with nothing, 18MB with English, vs ?? with "all locales".

@bnoordhuis
Copy link
Member

Maybe I can return with stupid questions of my own: is there any concept of persistent state, a "node.js home/configuration directory", etc where data could be loaded from? I think the answers are no, if not a resounding no.

Indeed. It's either baked into the binary or loaded through the module system - but that's long after the VM is up and running.

Idle speculation: with icu_use_data_file_flag=1, V8::InitializeICU() calls udata_setCommonData(). I know you can call that function multiple times but does that also work when ICU is already being used?

@srl295
Copy link
Member

srl295 commented Apr 30, 2014

long after the VM is up and running.

Right, the timing could be a problem here.

does that also work when ICU is already being used?

No, unless ICU is u_cleanup'ed and reloaded. I don't know if that's possible from within v8, if all ICU service objects in all threads can be accounted for and closed.

@trevnorris
Copy link
Author

I just want to make sure I understand what's going on.

How much source code would be added for just English support? (if it came bundled like other deps)

Are we willing to include an external dependency? (@tjfontaine)

How much would it add to the binary size?

Is it possible to include other languages after the program has started up?

@tjfontaine
Copy link

This will be the path forward:

  • Node will not be distributing binaries with i18n enabled by default for the short term
    • because of this for now we do not need to include the source in the repo (if ever)
    • we can ship a script that makes it easy to download source and language files
    • we can update the configure script to make it easy to define the languages you want to support
  • In the future Node will have two options
    • support language information built into the Node binary
    • support language information external to the Node binary
      • configurable at build time for a default location
      • configurable at runtime to redirect to a different location

This is basically the same work that needs to be done for certificate stores for crypto.

@srl295
Copy link
Member

srl295 commented May 12, 2014

@tjfontaine recappipng here from our IRC conversation. Please correct me if I misunderstood anything.

  • timeframe: post 0.12:
  • we can ship with i18n support built in but little/no data
  • at build time, a default disk location can be set for whatever data should be loaded
  • at runtime, --icu-data-dir on the command line or NODE_ICU_DATA as an env var can override that default location and load additional
  • icu can be enabled by default in builds, but not necessarily included in github

So, ICU supports all of this today, just need to make sure the right calls get made.

@trevnorris

  • How much would this add to the binary size? my target is <=4Mb for English only code+data
  • Is it possible to include other languages after the program has started up? Not without basically restarting ICU: making sure all instances of service objects are closed in all threads.

@ericf
Copy link

ericf commented May 12, 2014

I feel strongly that Intl should be supported by default and the implementation should be included in the default binaries. Otherwise Node.js will not be a ECMAScript 5.1 runtime. I would hate for this to become a trend where Node.js diverges from ECMAScript and not contain language features implemented in the all the other major ECMAScript runtimes (browsers).

This is clearly a complex language feature to implement, but it should be noted that Firefox and Chrome were able to ship Intl with data for the common locales in ~4MB of overhead.

It would be great to also get clarity from @tjfontaine on whether or not it's a core project principle/requirement for Node.js to be an ECMAScript runtime and plan to follow/implement updates to the spec. (I get that most of the language features will come in via v8, but things like Intl are the exception where this kind of principle can help guide the decisions.)

@NorbertLindenberg
Copy link

The statement "otherwise Node.js will not be a ECMAScript 5.1 runtime" is not correct. ECMAScript 5.1 is defined by standard ECMA-262 edition 5.1, while the ECMAScript Internationalization API is defined by a separate complementary standard, ECMA-402. A runtime can implement ECMA-262 edition 5.1 and thus ECMAScript 5.1 without implementing ECMA-402.

That said, I'm with Eric on the practical aspect: It's not clear why adding the internationalization API and ICU (with a combined download size of a few MB) is such a big issue for Node. Download size is a real issue for browsers because they're downloaded by end-users, many of whom (especially in developing countries) don't have broadband access yet. It shouldn't be an issue for server software, which is usually only downloaded by people (developers, ops people) who have broadband access. It seems at least Ebay, Yahoo, and IBM need the internationalization API in Node, and I can't imagine they're the only ones.

@ericf
Copy link

ericf commented May 14, 2014

@NorbertLindenberg thanks for clarifying the relationship between the ECMA-262 and ECMA-402 specs, and how they relate to ECMAScript 5.1.

@tjfontaine
Copy link

I don't find it controversial at all -- I will happily review PRs that work in the way I described

@poperor
Copy link

poperor commented Dec 19, 2014

Hi, just a late remark on all this: I strongly support that i18n support (including Collations and proper sorting) should be a standard part of node or at least should be easyly configurable, without doing a custom build. It is obvious that this is a central feature for lots and lots of non english languages (including my own norwegian language). Chrome, Firefox and IE are already supporting this on the client side.

@srl295
Copy link
Member

srl295 commented Dec 19, 2014

@poperor thanks for the support! Please see #7676 for the latest news.

richardlau pushed a commit to ibmruntimes/node that referenced this issue Jun 17, 2016
Addresses nodejs#5566. The `ee.once()` function is currently documented as
invoking the listener, and then removing it when the event is
triggered. However, this is not really the case. The listener is removed
and _then_ invoked. This only matters in a narrow set of use cases, but
when it matters, it matters that the docs are correct.

See the issue (nodejs#5566) for a discussion on why the code has not been
modified to match the documentation, but instead the documentation has
been modified to match the code.

Fixes: nodejs#5566
Ref: nodejs#6371
PR-URL: nodejs/node#7103
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Robert Lindstaedt <robert.lindstaedt@gmail.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Сковорода Никита Андреевич <chalkerx@gmail.com>
richardlau pushed a commit to ibmruntimes/node that referenced this issue Jun 29, 2016
Addresses nodejs#5566. The `ee.once()` function is currently documented as
invoking the listener, and then removing it when the event is
triggered. However, this is not really the case. The listener is removed
and _then_ invoked. This only matters in a narrow set of use cases, but
when it matters, it matters that the docs are correct.

See the issue (nodejs#5566) for a discussion on why the code has not been
modified to match the documentation, but instead the documentation has
been modified to match the code.

Fixes: nodejs#5566
Ref: nodejs#6371
PR-URL: nodejs/node#7103
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Robert Lindstaedt <robert.lindstaedt@gmail.com>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Сковорода Никита Андреевич <chalkerx@gmail.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests