Replies: 2 comments 7 replies
-
Thanks @adamziel - there's another small point about comparison to existing parsers that's worth noting, unless I missed it in your writing above. Because many existing solutions lack the ability to efficiently parse all of the URLs in a document, they often resort to full-URL string replace. This works pretty well because it's not likely to have With what you have, a way to efficiently and effectively determine "is this a link and does it point to the old domain, does it need rewriting to the new one?" we can now collapse all of these multiple passes into a single transformation of the document, rewriting only the domains that need changing, and possibly the whole base URL if the new site isn't at the |
Beta Was this translation helpful? Give feedback.
-
I did not try out the HTML Api, yet but at a glance on the code I just want to throw in here, that "parsing" a page like the HTML Api does probably takes a lot of time compared to "traditional" method of running a search and replacement. When running a migration for a site that has hundreds of thousands of different kind of post types, it will take a long time, parsing each post type. For migrating specific block content from one site to another the HTML API approach will work fine for sure, but for full site migrations I forecast a potential problem when it comes to speed. There are databases that have millions of rows and we need to make sure that all rows are going through a replacement for a full site migration. A few special S/R rules that needs to be taken into account are
You can have a look at our search & replace class that we created for WP Staging that contains a few more special cases. It's robust and handles all kind of special cases that we collected over the years. It's fully unit tested (although our tests are not in our public github repo) |
Beta Was this translation helpful? Give feedback.
-
Every time we want to migrate content to and from WordPress, we need to replace the original site URLs with the target site URL.
Traditional methods like
wp search-replace
just don't cut it. I've recently explored a solution based on HTML API that, despite being an early prototype, may already be the most comprehensive and correct URL rewriting tool out there.In this discussion, I'd like to:
Also, a lot of credit for these ideas goes to @dmsnell who spent countless hours building block parsers, HTML parsers, fixing unicode issues, and just being awesome.
The Problem with Traditional Methods
Traditional methods of URL replacement in WordPress, such as using the
wp search-replace
CLI command, come with several limitations that can lead to various issues. These problems stem from the simplistic nature of these methods, which treat the content as plain text without understanding the context or structure of the document. The primary pitfalls include:Inconsistent Replacements
Traditional URL replacement methods rely on straightforward string matching and replacement techniques. While this approach can be effective for simple cases, it often leads to inconsistent replacements in more complex scenarios. For example:
Lack of Context
The traditional methods treat the entire content as raw text and lack an understanding of the document’s structure. This can cause several issues:
href
orsrc
) and URLs that may appear in plain text, comments, or scripts. For instance, altering<div id="https://science.com">
to<div id="https://newsite.com">
might affect JavaScript or CSS, leading to unintended behaviors.<a href>
attribute or inside block markup.Here's a few examples:
Punycode, URL Encoding
The URL syntax described in WHATWG URL standard isn't trivial. There are special rules for encoding unicode characters, and they're different in paths and query strings. Here's just two:
%20
for spaces, making direct matching tricky. A naive replacement might fail to recognize or properly handle these encodings, leading to incomplete or erroneous replacements.The same URL may be expressed in a lot of diferent ways, for example:
Other edge Cases
In real-world use cases, URLs can take various forms and structures that challenge traditional search-replace methods:
https://science.com
might misshttps://blog.science.com
orhttps://science.com/path?query=1
. A person doing the migration might either want to either preserve or replace the latter two.<script>
tag might need to be migrated or might need to be left alone. Ditto for URLs found in HTML attributes such asclass
.The Solution Using HTML API
The HTML API-based prototype I’ve been developing addresses these traditional pitfalls by leveraging a more sophisticated approach to URL replacement that includes:
🚀-science.com/science
andhttps://xn---science-7f85g.com/%73%63ience
as the same URL.UPDATE
queries, we do all the rewriting before the data ever makes it into the database. This enables tracking progress, short-circuiting on error, retrying, and frontloading media files. We can always be sure that every post in the database was correctly migrated and doesn't have to be processed again.Technical details
Here's a few highlights from the https://github.com/adamziel/site-transfer-protocol/ repository where the prototype lives:
WP_HTML_Tag_Processor
with the ability to parse and rewrite block attributes.next_url()
method capable of semantically finding the next URL in text nodes, HTML attributes, and block markup. It also provides aset_url()
method that performs a context-aware substitution, escaping, and encoding.index.html
inthe index.html file
as a URL, but we do want to considerwordpress.org
inthe wordpress.org site
as one.Examples
Here's a sample of what the URL rewriting prototype can already do today. We're migrating
https://🚀-science.com/science
tohttps://science.wordpress.com
:Inline text
Gets rewritten as:
Punycode and HTML entities in text
Gets rewritten as:
Similar-looking domains
Gets rewritten as:
Block attributes
Gets rewritten as:
Non-URL attributes
Gets rewritten as:
Related
Beta Was this translation helpful? Give feedback.
All reactions