Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(sites-29416)!: Add support for importing non-image assets #7

Merged
merged 18 commits into from
Feb 26, 2025
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions src/index.js
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,9 @@
* governing permissions and limitations under the License.
*/
import { createJcrPackage } from './package/packaging.js';
import { getImageUrlsFromMarkdown } from './package/image-mapping.js';
import { getAssetUrlsFromMarkdown } from './package/asset-mapping.js';
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requires a full version bump when releasing!


export {
createJcrPackage,
getImageUrlsFromMarkdown,
getAssetUrlsFromMarkdown,
};
38 changes: 25 additions & 13 deletions src/package/image-mapping.js → src/package/asset-mapping.js
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,9 @@ const imageRegex = /!\[([^\]]*)]\(([^) "]+)(?: *"([^"]*)")?\)|!\[([^\]]*)]\[([^\
// Regex for reference definitions
const referenceRegex = /\[([^\]]+)]:\s*(\S+)/g;

// Regex for non-image asset links (PDFs, docs, excel etc.)
const nonImageAssetRegex = /(?:\[(.*?)\]|\[.*?\])\(([^)]+\.(?:pdf|doc|docx|xls|xlsx|ppt|pptx|odt|ods|odp|rtf|txt|csv))\)|\[(.*?)\]:\s*(\S+\.(?:pdf|doc|docx|xls|xlsx|ppt|pptx|odt|ods|odp|rtf|txt|csv))/gi;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a unit test to validate that we are catching the different types of assets.


/**
* Function to find reference definitions in a markdown file.
*
Expand All @@ -36,51 +39,60 @@ const findReferenceDefinitionsInMarkdown = (markdownContent) => {
};

/**
* Function to scan for images in a markdown file.
* Function to scan for assets in a markdown file.
*
* @param markdownContent - The content of the markdown file
* @returns {Array<string>} A Map of image urls as key
* @returns {Array<string>} A Map of asset urls as key
*/
const findImagesInMarkdown = (markdownContent) => {
const findAssetsInMarkdown = (markdownContent) => {
const references = findReferenceDefinitionsInMarkdown(markdownContent);

const imageUrls = [];
const assetUrls = [];

// Identify each image url in the markdown content
let match;
let url;
// eslint-disable-next-line no-cond-assign
while ((match = imageRegex.exec(markdownContent)) !== null) {
let url;
if (match[2]) { // Inline image
// eslint-disable-next-line prefer-destructuring
url = match[2];
} else if (match[5]) { // Reference-style image
url = references[match[5]] || null; // Resolve URL from reference map
}
if (url) {
imageUrls.push(url);
assetUrls.push(url);
}
}

// Find and add only non-image asset links
// eslint-disable-next-line no-cond-assign
while ((match = nonImageAssetRegex.exec(markdownContent)) !== null) {
url = match[2] || match[3];
if (url) {
assetUrls.push(url);
}
}

return imageUrls;
return assetUrls;
};

/**
* Get the list image urls present in the markdown.
* Get the list asset urls present in the markdown.
* @param {string} markdownContent - The content of the markdown file
* @returns {Array<string>} An array of image urls.
* @returns {Array<string>} An array of asset urls.
*/
const getImageUrlsFromMarkdown = (markdownContent) => {
const getAssetUrlsFromMarkdown = (markdownContent) => {
try {
return findImagesInMarkdown(markdownContent);
return findAssetsInMarkdown(markdownContent);
} catch (error) {
// eslint-disable-next-line no-console
console.warn('Error getting image urls from markdown:', error);
console.warn('Error getting asset urls from markdown:', error);
return [];
}
};

export {
// eslint-disable-next-line import/prefer-default-export
getImageUrlsFromMarkdown,
getAssetUrlsFromMarkdown,
};
53 changes: 30 additions & 23 deletions src/package/packaging.js
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ import {
import { saveFile } from '../shared/filesystem.js';

let jcrPages = [];
const ASSET_MAPPING_FILE = 'asset-mappings.json';

const init = () => {
jcrPages = [];
Expand All @@ -36,13 +37,13 @@ const addPage = async (page, dir, prefix, zip) => {

/**
* Updates the asset references in given xml, to point to their respective JCR paths
* @param xml - The xml content of the page
* @param pageUrl - The url of the site page
* @param assetFolderName - The name of the asset folder in AEM
* @param imageMappings - A map to store the image urls and their corresponding jcr paths
* @param {string} xml - The xml content of the page
* @param {string} pageUrl - The url of the site page
* @param {string} assetFolderName - The name of the asset folder(s) in AEM
* @param {Map} assetMappings - A map to store the asset urls and their corresponding jcr paths
* @returns {Promise<*|string>} - The updated xml content
*/
export const updateAssetReferences = async (xml, pageUrl, assetFolderName, imageMappings) => {
export const updateAssetReferences = async (xml, pageUrl, assetFolderName, assetMappings) => {
let doc;
try {
doc = getParsedXml(xml);
Expand All @@ -53,14 +54,14 @@ export const updateAssetReferences = async (xml, pageUrl, assetFolderName, image
}

// Start traversal from the document root and update the asset references
traverseAndUpdateAssetReferences(doc.documentElement, pageUrl, assetFolderName, imageMappings);
traverseAndUpdateAssetReferences(doc.documentElement, pageUrl, assetFolderName, assetMappings);

const serializer = new XMLSerializer();
return serializer.serializeToString(doc);
};

// eslint-disable-next-line max-len
export const getJcrPages = async (pages, siteFolderName, assetFolderName, imageMappings) => Promise.all(pages.map(async (page) => ({
export const getJcrPages = async (pages, siteFolderName, assetFolderName, assetMappings) => Promise.all(pages.map(async (page) => ({
path: page.path,
sourceXml: page.data,
pageProperties: getPageProperties(page.data),
Expand All @@ -69,7 +70,7 @@ export const getJcrPages = async (pages, siteFolderName, assetFolderName, imageM
page.data,
page.url,
assetFolderName,
imageMappings,
assetMappings,
),
jcrPath: getJcrPagePath(page.path, siteFolderName),
contentXmlPath: `jcr_root${getJcrPagePath(page.path, siteFolderName)}/.content.xml`,
Expand Down Expand Up @@ -118,19 +119,32 @@ const getEmptyAncestorPages = (pages) => {
return emptyAncestors;
};

/**
* Save the asset mappings to a file.
* @param {Map} assetMappings - A map of asset urls and their corresponding jcr paths
* @param {*} outputDirectory - The directory handle
*/
const saveAssetMappings = async (assetMappings, outputDirectory) => {
// Convert Map to a plain object
const obj = Object.fromEntries(assetMappings);

// Save the updated asset mapping content into a file
await saveFile(outputDirectory, ASSET_MAPPING_FILE, JSON.stringify(obj, null, 2));
};

/**
* Creates a JCR content package from a directory containing pages.
* @param {*} outputDirectory - The directory handle
* @param {Array} pages - An array of pages
* @param {Array<string>} imageUrls - An array of image urls that were found in the markdown.
* @param {string} siteFolderName - The name of the site folder in AEM
* @param {string} assetFolderName - The name of the asset folder in AEM
* @param {Array<string>} assetUrls - An array of asset urls that were found in the markdown.
* @param {string} siteFolderName - The name of the site folder(s) in AEM
* @param {string} assetFolderName - The name of the asset folder(s) in AEM
* @returns {Promise} The file handle for the generated package.
*/
export const createJcrPackage = async (
outputDirectory,
pages,
imageUrls,
assetUrls,
siteFolderName,
assetFolderName,
) => {
Expand All @@ -143,14 +157,11 @@ export const createJcrPackage = async (
const zip = new JSZip();
const prefix = 'jcr';

const imageMappings = new Map();
// add the images as keys to the map
imageUrls.forEach((url) => {
imageMappings.set(url, '');
});
// create a map using the provided asset urls as keys (values will be populated later)
const assetMappings = new Map(assetUrls.map((url) => [url, '']));

// add the pages
jcrPages = await getJcrPages(pages, siteFolderName, assetFolderName, imageMappings);
jcrPages = await getJcrPages(pages, siteFolderName, assetFolderName, assetMappings);
for (let i = 0; i < jcrPages.length; i += 1) {
const page = jcrPages[i];
// eslint-disable-next-line no-await-in-loop
Expand All @@ -177,9 +188,5 @@ export const createJcrPackage = async (
await zip.generateAsync({ type: outputType })
.then(async (blob) => saveFile(outputDirectory, `${packageName}.zip`, blob));

// Convert Map to plain object
const obj = Object.fromEntries(imageMappings);

// Save the updated image mapping content into a file in the output directory
await saveFile(outputDirectory, 'image-mapping.json', JSON.stringify(obj, null, 2));
await saveAssetMappings(assetMappings, outputDirectory);
};
66 changes: 44 additions & 22 deletions src/package/packaging.utils.js
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ export const getFilterXml = (jcrPages) => {
* followed by the page name. If there are multiple pages, the package name
* will be the site folder name.
* @param {Array<Page>} pages the pages to be included in the package.
* @param {string} siteFolderName the name of the site folder in AEM.
* @param {string} siteFolderName the name of the site folder(s) in AEM.
* @returns {string} the package name.
*/
export const getPackageName = (pages, siteFolderName) => {
Expand All @@ -123,8 +123,8 @@ export const getPackageName = (pages, siteFolderName) => {
/**
* Get the JCR page path based on the site folder name and the path.
* @param {string} path the path of the page
* @param {string} siteFolderName the name of the site folder in AEM
* @returns the JCR page path
* @param {string} siteFolderName the name of the site folder(s) in AEM
* @returns {string} the JCR page path
*/
export const getJcrPagePath = (path, siteFolderName) => {
if (path.startsWith('/content/')) {
Expand All @@ -141,36 +141,58 @@ export const getJcrPagePath = (path, siteFolderName) => {

/**
* Get the JCR path for an asset.
* NOTE: We use lower case for the asset folder names, since in AEM DAM
* paths are case-sensitive; AEM automatically generates a JCR node name
* that follows a lowercase, so reference paths should also use lower case.
* @param {URL} assetUrl - The URL of the asset
* @param {string} assetFolderName - The name of the asset folder in AEM
* @returns the JCR path for the asset
* @param {string} assetFolderName - The name of the asset folder(s) in AEM
* @returns {string} the JCR path for the asset.
*/
const getJcrAssetPath = (assetUrl, assetFolderName) => {
const extension = (assetUrl.pathname.includes('.')) ? `.${assetUrl.pathname.split('.').pop()}` : '';
let path = assetUrl.pathname.replace(extension, '');
export const getJcrAssetPath = (assetUrl, assetFolderName) => {
let path = assetUrl.pathname;
let jcrAssetPath;
// Extract file extension (only the last part)
const lastDotIndex = path.lastIndexOf('.');
let extension = '';

// if there is a valid extension, remove it from the path
if (lastDotIndex !== -1 && lastDotIndex > path.lastIndexOf('/')) {
extension = path.substring(lastDotIndex);
// Remove only the last extension from path
path = path.substring(0, lastDotIndex);
}

if (path.startsWith('/content/dam/')) {
// replace the 3rd token with the asset folder name
const tokens = path.split('/');
const assetFolderTokens = assetFolderName.split('/');

// Find and remove existing occurrence of assetFolderName
for (let i = 3; i <= tokens.length - assetFolderTokens.length; i += 1) {
if (tokens.slice(i, i + assetFolderTokens.length).join('/') === assetFolderName) {
tokens.splice(i, assetFolderTokens.length);
break;
}
}

// insert the assetFolderName in index position 3 ("", /content, /dam)
// and move everything after over resulting in /content/dam/<site>/<asset_path>
tokens.splice(3, 0, assetFolderName);
return `${tokens.join('/')}${extension}`;
}
tokens.splice(3, 0, ...assetFolderTokens);

const suffix = '';
// replace media_ with media1_ in path to avoid conflicts with the media folder
path = path.replace('/media_', '/media1_');

return `/content/dam/${assetFolderName}${path}${suffix}${extension}`;
jcrAssetPath = `${tokens.join('/')}${extension}`;
} else {
// replace media_ with media1_ in path to avoid conflicts with the media folder
path = path.replace('/media_', '/media1_');
jcrAssetPath = `/content/dam/${assetFolderName}${path}${extension}`.toLowerCase();
}
return jcrAssetPath.toLowerCase();
};

/**
* Get the JCR path for a asset reference.
* @param {string} assetReference the asset reference
* @param {string} pageUrl the URL of the page
* @param {string} assetFolderName the name of the asset folder in AEM
* @returns the JCR path for the file reference
* @param {string} assetFolderName the name of the asset folder(s) in AEM
* @returns {string} the JCR path for the file reference
*/
const getJcrAssetRef = (assetReference, pageUrl, assetFolderName) => {
const host = new URL(pageUrl).origin;
Expand Down Expand Up @@ -219,7 +241,7 @@ export function getFullAssetUrl(assetReference, pageUrl) {

// If the asset reference starts with './', it is a relative file path
if (assetReference.startsWith('./')) {
return new URL(assetReference, pageUrlObj.href).pathname;
return new URL(assetReference, pageUrlObj.href).href;
}

// Absolute asset reference, appending the asset path to the host
Expand All @@ -243,14 +265,14 @@ function updateJcrAssetMap(jcrAssetMap, originalPath, updatedAssetPath, pageUrl)
* Traverse the DOM tree and update the asset references to point to the JCR paths.
* @param {*} node - The node to traverse
* @param {string} pageUrl - The URL of the page
* @param {string} assetFolderName - The name of the asset folder in AEM
* @param {string} assetFolderName - The name of the asset folder(s) in AEM
* @param {Map} jcrAssetMap - A map of asset references to their corresponding JCR paths
*/
export const traverseAndUpdateAssetReferences = (node, pageUrl, assetFolderName, jcrAssetMap) => {
if (node.nodeType === 1) { // Element node
// eslint-disable-next-line no-restricted-syntax
for (const attr of node.attributes) {
// Unescape HTML entities (needs double decoding as image urls are double encoded in the xml)
// Unescape HTML entities (needs double decoding as asset urls are double encoded in the xml)
// console.log(`Checking attribute: ${attr.name}`);
let attrValue = he.decode(he.decode(node.getAttribute(attr.name)));
const keys = [...jcrAssetMap.keys()];
Expand Down
9 changes: 9 additions & 0 deletions test/fixtures/mystique/hero.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
+---------------------------------------------+
| Hero |
+=============================================+
| ![][image0] |
+---------------------------------------------+
| # Say Hello to Effortless Webpage Creation! |
+---------------------------------------------+

[image0]: https://experience-platform-mystique-deploy-ethos102-stage-88229c.stage.cloud.adobe.io/proxy-4b739f7f3d2b43009055c893cdf99ba8-4f279b4c398f4d1d98d4552e3cf521ca/assets/media_18c9c39f49c4050fb54d50b03085228a12a9b1666.png "Effortless Webpage Creation with Mystique's AEM Crosswalk"
10 changes: 10 additions & 0 deletions test/fixtures/mystique/hero.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
<?xml version="1.0" encoding="UTF-8"?>
<jcr:root xmlns:jcr="http://www.jcp.org/jcr/1.0" xmlns:nt="http://www.jcp.org/jcr/nt/1.0" xmlns:cq="http://www.day.com/jcr/cq/1.0" xmlns:sling="http://sling.apache.org/jcr/sling/1.0" jcr:primaryType="cq:Page">
<jcr:content cq:template="/libs/core/franklin/templates/page" sling:resourceType="core/franklin/components/page/v1/page" jcr:primaryType="cq:PageContent">
<root jcr:primaryType="nt:unstructured" sling:resourceType="core/franklin/components/root/v1/root">
<section sling:resourceType="core/franklin/components/section/v1/section" jcr:primaryType="nt:unstructured">
<block sling:resourceType="core/franklin/components/block/v1/block" jcr:primaryType="nt:unstructured" image="https://experience-platform-mystique-deploy-ethos102-stage-88229c.stage.cloud.adobe.io/proxy-4b739f7f3d2b43009055c893cdf99ba8-4f279b4c398f4d1d98d4552e3cf521ca/assets/media_18c9c39f49c4050fb54d50b03085228a12a9b1666.png" model="hero" modelFields="[image,imageAlt,text]" name="Hero" text="&lt;p&gt;&lt;h1&gt;Say Hello to Effortless Webpage Creation!&lt;/h1&gt;&lt;/p&gt;"></block>
</section>
</root>
</jcr:content>
</jcr:root>
Loading