Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor properties to separate crate #1153

Merged
merged 16 commits into from
Oct 20, 2021

Conversation

echeran
Copy link
Contributor

@echeran echeran commented Oct 6, 2021

Renaming the UnicodePropertyV1 data struct, which represents UnicodeSet data for binary properties and for enumerated properties (per key=val pair). It would allow the corresponding data provider struct for CodePointTrie data, which would only exist for enumerated properties, to have a consistent name, once created.

@iainireland
Copy link
Contributor

The "Data" in "UnicodeSetData" doesn't seem necessary. (Compare, say icu_decimal::provider, where we have AffixesV1, GroupingSizesV1, and so on.) Alternative suggestions: UnicodeSetV1, UnicodePropertyV1, UnicodePropertySetV1?

What do you intend to call the CodePointTrie equivalent?

@echeran echeran linked an issue Oct 7, 2021 that may be closed by this pull request
@echeran
Copy link
Contributor Author

echeran commented Oct 8, 2021

The "Data" in "UnicodeSetData" doesn't seem necessary. (Compare, say icu_decimal::provider, where we have AffixesV1, GroupingSizesV1, and so on.) Alternative suggestions: UnicodeSetV1, UnicodePropertyV1, UnicodePropertySetV1?

What do you intend to call the CodePointTrie equivalent?

I looked at other data provider structs, and it seems like they all have a plural noun name, not a singular word name -- just an observation.

As far as how to re-/name the data struct for UnicodeSet data in a way that will be consistent with the data struct for CPT data, I'm open to suggestions for improve my naming ideas. We currently have UnicodePropertyV1 and UnicodePropertyMapV1, but ythen ou could imagine changing UnicodePropertyV1 -> UnicodePropertySetV1 for consistency. But then at that point, why not drop the "Property" -- b/c I think the distinguishing factor between these data structs is the shape of the data (the data structure) that they contain -- and then you get UnicodeSetV1 and UnicodeMapV1.

I thought UnicodeSetV1 sounded too much like UnicodeSet itself (which is why I originally said UnicodeSetDataV1), but that's enough naming bikeshedding for me, and I usually defer to others for short better names.

@echeran echeran mentioned this pull request Oct 11, 2021
3 tasks
@echeran echeran force-pushed the props-data-refactor branch from e5c4dec to 5b84b60 Compare October 13, 2021 01:09
@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • Cargo.lock is now changed in the branch
  • Cargo.toml is now changed in the branch
  • components/uniset/Cargo.toml is now changed in the branch
  • components/uniset/src/lib.rs is now changed in the branch
  • components/uniset/src/props.rs is different
  • components/uniset/src/provider.rs is no longer changed in the branch
  • docs/tutorials/writing_a_new_data_struct.md is no longer changed in the branch
  • provider/uprops/src/binary.rs is no longer changed in the branch
  • provider/uprops/src/enumerated.rs is no longer changed in the branch
  • provider/uprops/src/provider.rs is no longer changed in the branch
  • utils/properties/Cargo.toml is now changed in the branch
  • utils/properties/README.md is now changed in the branch
  • utils/properties/src/enum_props.rs is now changed in the branch
  • utils/properties/src/lib.rs is now changed in the branch
  • utils/properties/src/ule.rs is now changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@echeran echeran changed the title Properties data struct renaming Refactor properties to separate crate Oct 13, 2021
@echeran
Copy link
Contributor Author

echeran commented Oct 13, 2021

The conversation above is relevant to PR #1167, and I incorporated those ideas into that PR since it seemed necessary to avoid confusing code and naming.

However, in order for that PR to proceed, it needs to solve a confusing and misleading circular dependency that is currently caused by having property enums / newtype structs defined in icu_uniset, but the TrieValue trait in icu_codepointtrie. We intend to implement TrieValue for the property enums/newtypes.

At the least, we should refactor the property enums/newtypes to a separate crate. This PR is being renamed and re-purposed to represent that new scope of work which will unblock #1167.

(And if Rust only allows impls of traits on enums/newtypes in the same module/crate in which they are defined, then perhaps the TrieValue trait should be moved to this new crate, too?)

@iainireland
Copy link
Contributor

Regarding names: I don't have strong feelings, but it occurs to me that if we were doing this from scratch, without basing anything on ICU4C, I think it's likely that we'd omit "Unicode" and call these something more like "CharacterSet" and "CharacterMap". Most of our data is from Unicode; what's interesting about this data is that it represents character properties.

@echeran echeran linked an issue Oct 18, 2021 that may be closed by this pull request
3 tasks
@echeran echeran requested a review from sffc October 18, 2021 21:47
Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure all the points from the renaming issue are covered

pub struct GeneralCategory(pub(crate) u32);
pub struct GeneralCategory(pub u32);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert this change, and if you really need it, add a impl TryFrom<u32> for GeneralCategory if there isn't already such an impl. @iainireland thoughts? Don't we already need this for the Serde?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

pub struct Script(pub(crate) u16);
pub struct Script(pub u16);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 5 to 9
//! `icu_properties` is a utility crate of the [`ICU4X`] project.
//!
//! This component provides definitions of [Unicode Properties].
//!
//! [Unicode Properties]: https://unicode-org.github.io/icu/userguide/strings/properties.html
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Please strengthen these docs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

mod enum_props;
mod ule;

pub use enum_props::*;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussion: Do we want to pub use enum_props::* and put everything in the top-level namespace?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see two reasonable choices:

  1. icu_properties::get_alnum() and icu_properties::get_general_category()
  2. icu_properties::sets::get_alnum() and icu_properties::maps::get_general_category()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I slightly favor hoisting to the top level (1/10), but I can see advantages to the smaller namespaces.

@zbraniecki says:

my intuition is that it's easier to consoidate namespaces than separate them, so I'd start with more specific (::sets::) and reconsider once the API surface stabilizes

So I think we should go ahead with the smaller namespaces for now, and revisit this later when we do a comprehensive API review prior to 1.0. (We don't need a separate issue to track that)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. We only have getter APIs for UnicodeSets, which are in the new icu_properties::sets module. The enums/newtypes for the actual properties themselves are up in the top-level namespace.

@echeran echeran marked this pull request as ready for review October 19, 2021 01:17
@echeran echeran requested review from iainireland and a team as code owners October 19, 2021 01:17
@echeran echeran requested a review from sffc October 19, 2021 01:17
pub use icu_uniset::*;
pub use icu_properties::*;
Copy link
Member

@sffc sffc Oct 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename icu::uniset to icu::properties.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line was part of the section defining icu::uniset, which was removed.

/// use icu::uniset::UnicodeSetBuilder;
/// use icu_uniset::UnicodeSetBuilder;
Copy link
Member

@sffc sffc Oct 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should still be icu::uniset, pending the discussion below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we decided to remove icu::uniset, I am keeping it was icu_uniset because we cannot revert back to icu::uniset.

@echeran echeran requested a review from sffc October 19, 2021 17:56
Comment on lines +24 to +25
/// GeneralSubcategory only supports specific subcategories (eg `UppercaseLetter`).
/// It does not support grouped categories (eg `Letter`). For grouped categories, use [`GeneralCategory`].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Praise: good catch

Comment on lines +379 to +382
//! [`ICU4X`]: ../icu/index.html
//! [Unicode Properties]: https://unicode-org.github.io/icu/userguide/strings/properties.html
//! [`UnicodeSet`]: ../../icu_uniset/struct.UnicodeSet.html
//! [`sets`]: sets
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and below: Let's fix these crossrefs.

#![no_std]

mod props;
#[allow(unused)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you have the #[allow(unused)] ?

@@ -658,7 +659,7 @@ where
GeneralCategory::SpaceSeparator => key::GENERAL_CATEGORY_SPACE_SEPARATOR_V1,
_ => return Err(UnicodeSetError::UnknownGeneralCategorySet(enum_val.0)),
};
get_prop(provider, key)
get_uniset(provider, key)
}

/// Return a [`UnicodeSet`] for a particular value of the Script Unicode enumerated property
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: change this function to get_for_script so that you call it as icu::properties::sets::get_for_script()

@echeran echeran requested a review from sffc October 20, 2021 00:35
sffc
sffc previously approved these changes Oct 20, 2021
@echeran echeran merged commit 30f0687 into unicode-org:main Oct 20, 2021
@echeran echeran deleted the props-data-refactor branch October 20, 2021 01:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor property definitions to a separate crate Crate layout of Unicode Properties
3 participants