Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why not using python builtin html unescape #9270

Closed
4 of 8 tasks
remitamine opened this issue Apr 21, 2016 · 6 comments
Closed
4 of 8 tasks

why not using python builtin html unescape #9270

remitamine opened this issue Apr 21, 2016 · 6 comments

Comments

@remitamine
Copy link
Collaborator

remitamine commented Apr 21, 2016

  • I've verified and I assure that I'm running youtube-dl 2016.04.19
  • At least skimmed through README and most notably FAQ and BUGS sections
  • Searched the bugtracker for similar - [ ] Bug report (encountered problems with youtube-dl)
  • Bug report (encountered problems with youtube-dl)
  • Site support request (request for adding support for a new site)
  • Feature request (request for a new functionality)
  • Question
  • Other

moved from 9e6dd23#commitcomment-17012611.
why the html unescape that came with python http://stackoverflow.com/a/2360639 is not used it can handle two cases that the unescapeHTML can't(' and the ones that start with &#X) and it was improved in the last version(handle HTML5 named character references).

@yan12125
Copy link
Collaborator

I guess a reason is mentioned in #329:

_unescapeHTML did the same thing as htmlentity_transform but using the undocumented HTMLParser.unescape

@remitamine
Copy link
Collaborator Author

remitamine commented Apr 21, 2016

it can handle two cases that the unescapeHTML can't(' and the ones that start with &#X)

if _unescapeHTML did the same thing. it whould give the same result.

@yan12125
Copy link
Collaborator

The problem is undocumented. I guess it's OK as CPython is the de facto standard. Other Python implementations, such as PyPy and Jython, are moving towards using the same set of standard libraries. Ideas from other developers needed, though.

@phihag
Copy link
Contributor

phihag commented Apr 21, 2016

This function is really old. As long as the tests pass (we may need to add some to cover calling it with None and b'123'), feel free to replace its implementation. Note as @yan12125 , we should only use public python interfaces.

@yan12125
Copy link
Collaborator

If one day youtube-dl supports Python 3.4+ only, we can use it: https://docs.python.org/3/library/html.html#html.unescape. Closing now.

Something not directly related to this topic but also interesting: http://bugs.python.org/issue27032

yan12125 pushed a commit that referenced this issue Jun 10, 2016
Used in tset_Vporn_1. Also Related to #9270
yan12125 pushed a commit that referenced this issue Jun 10, 2016
Used in test_Vporn_1. Also related to #9270
@yan12125
Copy link
Collaborator

Since 55b2f09 and 9631a94 HTML5 entities are recognized. It works fine for Vporn. Are there other websites using HTML5 entites?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants