Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Share more code with xml5ever #266

Closed
SimonSapin opened this issue Apr 13, 2017 · 10 comments · Fixed by #268
Closed

Share more code with xml5ever #266

SimonSapin opened this issue Apr 13, 2017 · 10 comments · Fixed by #268

Comments

@SimonSapin
Copy link
Member

Yes, after discussion in https://github.com/Ygg01/xml5ever/issues/30 we’ve moved the xml5ever code into this repository with the intention to share more code. However much of that has yet to happen. Rough plan:

  • Make every crate in the repo use a single workspace

  • Make sure Travis-CI is running every test

  • Rename the the html5ever_atoms crate to markup5ever and update html5ever and xml5ever to use it. This is a breaking change, so increment version numbers accordingly.

  • Make it so that users of either html5ever or xml5ever don’t need to have an explicit dependency to markup5ever (which should be an implementation detail). One way to do this is to have pub use markup5ever::*; at the root of each crate.

  • Find things that exist in both html5ever and xml5ever, and move them to markup5ever. Judge on a case-by-case basis when there are differences.

    For example TokenSink has some methods in common, but both parsers have some methods unique to it in their respective TokenSink. Since this trait is implemented by users of the libraries, the shared TokenSink should have the union of all methods (to support both parsers) and provide default implementations of methods where that makes sense (to reduce the overhead for users that only care about one parser).

    The Parser and BytesParser types in html5ever::driver are busted for handling </script>, we might want to remove them or redesign them. (Don’t bother adding it to xhtml5ever.)

@Ygg01
Copy link
Contributor

Ygg01 commented Apr 13, 2017

Sounds like a plan. I am assuming following structure for workspace:

├── Cargo.toml
├── markupl5ever
|   ├── Cargo.toml
|   └── src
|  └──...
├── html5ever
|   ├── Cargo.toml
|   └── src
|    └──...
└── xml5ever
    ├── Cargo.toml
    └── src
    └──...

@SimonSapin
Copy link
Member Author

Looks good 👍

I sometimes don’t use a src directory (with [lib] path = "lib.rs" in Cargo.toml), but it’s a matter of taste and how little stuff there would be besides Cargo.toml and src.

@Ygg01
Copy link
Contributor

Ygg01 commented Apr 14, 2017

Here is the WIP: https://github.com/Ygg01/html5ever/tree/xhtml5ever

Questions I have are following:

  1. What does travis-build.sh does? It's called in .travis.yml, but seems to essentially, just call cargo test? Or is the capture part doing something interesting?

  2. What QualName to use? Apparently, there were discussions around QualName having a local, prefix and namespace, but there were fears about it exploding size? Could I abstract over it like so:

struct QualName<T>{
    pub localname: Localname,
    pub namespace: Namespace,
    pub prefix: T,
}

type XmlQualName = QualName<Prefix>;
type HtmlQualName = QualName<()>;
  1. This one is more of a question for me, but I think there isn't a huge value in EmptyTag (e.g. <tagName/>) and Short tag (e.g. </>). Without those two, I could reuse TagKind for both[1] [2]. However, I might not change this now, because changes will be numerous and hard even without it; and this change requires changing of XML parsing. Are these changes ok, in your opinion?

  2. Is there a way to preserve the git log while moving renaming files? The files I didn't author list me as sole contributor...

@SimonSapin
Copy link
Member Author

  1. travis-build.sh could be inlined in .travis.yml. Having a separate file avoid doubt of what escaping is required to nest bash syntax inside YAML string syntax.

    Rust’s test harness has a feature where the output of a test is "captured" and shown only that test fails, together with the failure message. Since tests can run in parallel on multiple threads, this avoids interleaved output and shows which output comes from which test. This feature is based on #[unstable] APIs from the standard library.

    html5ever uses a fork of the test crate that exposes APIs that are #[unstable] in the standard library copy, in order to generate the list of tests dynamically. To make this fork usable on stable Rust, any feature that uses #[unstable] std APIs are optional and disabled by default. --features "rustc-test/capture" enables the capture feature.

  2. I think we should not worry about the size of QualName. Servo only uses it transiently during parsing. What is stored in Element for example looks like this:

    pub struct Element {
        // …
        local_name: LocalName,
        namespace: Namespace,
        prefix: Option<DOMString>,
        // …
    }

    Let’s have a single QualName type that contains Option<Prefix>. This will often be None in html5ever, except in TreeBuilderActions::adjust_foreign_attributes. Adjust foreign attributes in the spec can set a non-null prefix.

  3. Empty tags are also part of XML 1. Even if they weren’t, this alone doesn’t feel like a strong enough reason to drop features from XML 5. And this is the interface between tokenizers and their respective tree builders, which are not interchangeable and typically implementation details not used directly by users of these libraries. Keeping two TagKind enum sounds fine.

  4. Git does not explicitly store information about renames. Instead it can reconstruct that information when querying history. For git log, this is enabled by the --follow (for a single file) or -M a.k.a. --find-renames (for diffs) options. This is based on searching for new files that look similar to files removed in the same commit. If the files are not identical, there’s a heuristic to decide whether they’re close enough. So if you’re both renaming and making significant changes to a file, consider doing so in separate commits to help this heuristic.

@karan1276
Copy link

Hello, I am a student. I recently started contributing on servo. I found my self doing only "easy" bugs, which gets boring after some time. I would like to work on servo and actually understand whats going on. To begin with i have decided to work on html5ever first and understand how parsing is done at lower levels, If you guys could sort of mentor me that will be great( @SimonSapin @Ygg01 )

Looks like this issue has lots of stuff to do, may i be of any help?

@SimonSapin
Copy link
Member Author

I think @Ygg01 has already started working on this, so you should coordinate together :)

@Ygg01
Copy link
Contributor

Ygg01 commented Apr 20, 2017

@karan1276 I will gladly accept a PR to my PR :P So what do you want to do? Want do you want to know about? You can look at xml5ever as a simplified version of html5ever. Basically, xml5ever and hmtl5ever tokenization is similar, but tree building is waaaay different, because HTML has insertion modes.

@SimonSapin Could I partially commit this? I think I did items 1-4 on that list, but recently some PRs broke my commits :( Last bullet point is really numerous, and might take more sense to divide.

@SimonSapin
Copy link
Member Author

@Ygg01 one thing is that if there are breaking API changes I’d prefer to batch them as much as reasonable. Do you think there’s work that can be deferred to later without introducing more breakage then? In any case feel free to send PRs :)

@Ygg01
Copy link
Contributor

Ygg01 commented Apr 25, 2017

@SimonSapin Ok, I think I got most of stuff done. Only few issues remain:

  1. xml5ever TokenSink/TreeSink bitrotted a bit. E.g. it still has old fn process_token. My thoughts are keep those methods but mark them for deprecated, and/or remove them in next version (PR will be too hard to scrutinize as it is, it's ~300 additions and ~3000 deletions)

  2. I want to re-export QualNames while I am at it, to close QualName should be re-exported #210. What does this entail? Just pub use markup5ever::* at root of xml5ever and html5ever or like this

  3. markup5ever, now contains a lot of different and unconnected stuff. I moved named_entities, buffer_queue, small_character_set, TokenSink, TreeSink. I've organized them like this:

├── markupl5ever
   ├── Cargo.toml
   └── src
       ├── data
       |     ├── entities.json
       |     └── mod.rs (concatenates source of named_entries.rs)
       ├── util
       |     ├── small_char_set.rs
       |     └── buffer_queue.rs
       ├── interface 
       |     ├── mod.rs
       |     ├── tree_builder.rs
       |     └── tokenizer.rs
       └── lib.rs

Is there in your opinion better way to organize these, things?

@SimonSapin
Copy link
Member Author

Don’t spend effort trying to unify TokenSink. The XML5 tokenizer only needs to work with the XML5 tree builder, and the HTML tokenizer only needs to work with the HTML tree builder. They are not interchangeable.

Regarding deprecation, we’re making breaking changes anyway so I wouldn’t worry too much about it.

pub use markup5ever::*; sounds good. No risk the we forget to re-export something.

@Ygg01 Ygg01 mentioned this issue Apr 29, 2017
12 tasks
bors-servo pushed a commit that referenced this issue May 2, 2017
Xhtml5ever

Ok, this is a large one.

Fixes #266, fixes #261, fixes #210.

It moves html5ever into separate folder, renames html5ever macros markup5ever and stores common code there.

Here is short summary of what I know is and isn't done.

- [x] Make every crate in the repo use a single workspace
- [x] Make sure Travis-CI is running every test
- [x] Rename the the html5ever_atoms crate to markup5ever and update html5ever and xml5ever to use it.
- [x] Increment version numbers
- [x] Make it so that users of either html5ever or xml5ever don’t need to have an explicit dependency to markup5ever
- [x] Export QualName #210
- [x] let markup5ever generate entities.json #261
- [ ] **Move TokenSink to markup5ever**
- [x] Move TreeSink to markup5ever
- [x] Move BufferQueue to markup5ever
- [x] Move SmallCharSet to markup5ever
- [ ] **Deal with driver.rs**

<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/servo/html5ever/268)
<!-- Reviewable:end -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants