Skip to content

Commit

Permalink
feat(import): multiple docx output files
Browse files Browse the repository at this point in the history
  • Loading branch information
kptdobe authored Jul 6, 2022
1 parent 780d0a3 commit fa4610b
Show file tree
Hide file tree
Showing 10 changed files with 2,246 additions and 2,243 deletions.
7 changes: 5 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,12 @@ In the `URL(s)` field, give a list of page URLs to be imported (e.g. {https://ww

### Transformation file

A default html to Markdown is applied by you can / need to provide your own. Initially the import transformation file is fetched at http://localhost:3001/tools/importer/import.js (can be changed in the options). Create the file using the following template:
A default html to Markdown is applied by you can / need to provide your own. Initially the import transformation file is fetched at http://localhost:3001/tools/importer/import.js (can be changed in the options). Create the file using the following templates:

https://gist.github.com/kptdobe/8a726387ecca80dde2081b17b3e913f7
- if you need to create a single md/docx file out from each input page, you can use this template: https://gist.github.com/kptdobe/8a726387ecca80dde2081b17b3e913f7
- if you need to crate multiple files md/docx out from each input page, you must use this template: https://gist.github.com/kptdobe/7bf50b69194884171b12874fc5c74588

Note that in the current state, the 2 templates are doing the exact same thing. But the second one uses the `transform` method and the return array contain more than one element. See guidelines for an example.

### Guidelines

Expand Down
8 changes: 8 additions & 0 deletions css/import/import.css
Original file line number Diff line number Diff line change
Expand Up @@ -160,3 +160,11 @@
.import #import-markdown-preview td {
padding: 0 6px;
}

.import #import-file-picker-container {
width: 100%;
}

.import #import-file-picker-container sp-picker {
width: 100%;
}
1 change: 1 addition & 0 deletions import.html
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,7 @@ <h3>Page preview</h3>
<sp-tab label="Preview" value="import-preview"></sp-tab>
<sp-tab label="Markdown" value="import-markdown"></sp-tab>
<sp-tab label="HTML" value="import-html"></sp-tab>
<div id="import-file-picker-container"></div>
<sp-tab-panel value="import-preview">
<sp-theme color="light" scale="medium">
<div id="import-markdown-preview"></div>
Expand Down
86 changes: 83 additions & 3 deletions importer-guidelines.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,15 +16,33 @@ Out of the box, the importer should be able to consume any page and output a Mar

Such a rule is very straight forward to implement: it is usually a set of DOM operations: create new, move or delete DOM elements.

In your `import.js` transformation file, you can implement 2 methods:
In your `import.js` transformation file, you can implement 2 modes:
- one input / one output
- one input / multiple outputs

- `transformDOM: ({ document, url, html }) => {}`: implement here your transformation rules and return the DOM element that needs to be transformed to Markdown (default is `document.body` but usually a `main` element is more relevant).
#### one input / one output

You must implement those 2 methods:

- `transformDOM: ({ document, url, html, params }) => {}`: implement here your transformation rules and return the DOM element that needs to be transformed to Markdown (default is `document.body` but usually a `main` element is more relevant).
- `document`: the incoming DOM
- `url`: the current URL being imported
- `html`: the original HTML source (when loading the DOM as a document, some things are cleaned up, having the raw original HTML is sometimes useful)
- `generateDocumentPath: ({ document, url }) => {}`: return a path that describes the document being transformed - allows you to define / filter the page name and the folder structure in which the document should be stored (default is the current url pathname with the trailing slash and the `.html`)
- `params`: some params given by the importer. Only param so far is the `originalURL` which is the url of the page being imported (url is the one to the proxy)
- `generateDocumentPath: ({ document, url, html, params }) => {}`: return a path that describes the document being transformed - allows you to define / filter the page name and the folder structure in which the document should be stored (default is the current url pathname with the trailing slash and the `.html`). Params are the same than above.

This is simpler version of the implementation. You can achieve the same by implementing the `transform` method as describe below.

#### one input / multiple outputsw

You must implement this method:
- `transform: ({ document, url, html, params }) => {}`: implement here your transformation rules and return an array of pairs `{ element, path }` where element is a DOM DOM element that needs to be transformed to Markdown and path is the path to the exported file.
- `document`: the incoming DOM
- `url`: the current URL being imported
- `html`: the original HTML source (when loading the DOM as a document, some things are cleaned up, having the raw original HTML is sometimes useful)
- `params`: some params given by the importer. Only param so far is the `originalURL` which is the url of the page being imported (url is the one to the proxy)

The idea is simple: return a list of elements that will be converted to docx and stored at the path location.

## Rule examples

Expand Down Expand Up @@ -241,6 +259,68 @@ Output is then:
# Hello World
![](https://www.sample.com/images/helloworld.png);
```

### Mutiple output

If you need to transform one page into multiple Word documents (fragments, banners, author pages...), you can use the `transform` method.

Input DOM:

```html
<html>
<head></head>
<body>
<main>
<h1>Hello World</h1>
<div class="hero" style="background-image: url(https://www.sample.com/images/helloworld.png);"></div>
</main>
</body>
</html>
```

With the following `import.js`, you will get 2 md / docx documents:

```js
{
transform: ({ document, params }) => {
const main = document.querySelector('main');
// keep a reference to the image
const image = main.querySelector('.hero')

//remove the image from the main, otherwise we'll get it in the 2 documents
WebImporter.DOMUtils.remove(main, [
'.hero',
]);

return [{
element: main,
path: '/main',
}, {
element: image,
path: '/image',
}];
},
}
```

Outputs are:

`/main.md`

```md
# Hello World
```

`/image.md`

```md
![](https://www.sample.com/images/helloworld.png);
```

Note:
- be careful with the DOM elements you are working with. You always work on the same document thus you may destruct elements for one output which may have an inpact on the other outputs.
- you may have as many outputs as you want (limit not tested yet).

### More samples

Sites in the https://github.com/hlxsites/ organisation have all be imported. There are many different implementation cover a lot of use cases.
Expand Down
104 changes: 73 additions & 31 deletions js/import/import.ui.js
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
/* global CodeMirror, showdown, html_beautify, ExcelJS */
import { initOptionFields, attachOptionFieldsListeners } from '../shared/fields.js';
import { getDirectoryHandle, saveFile } from '../shared/filesystem.js';
import { asyncForEach } from '../shared/utils.js';
import PollImporter from '../shared/pollimporter.js';
import alert from '../shared/alert.js';

Expand All @@ -38,6 +39,8 @@ const IS_BULK = document.querySelector('.import-bulk') !== null;
const BULK_URLS_HEADING = document.querySelector('#import-result h2');
const BULK_URLS_LIST = document.querySelector('#import-result ul');

const IMPORT_FILE_PICKER_CONTAINER = document.getElementById('import-file-picker-container');

const ui = {};
const config = {};
const importStatus = {
Expand Down Expand Up @@ -68,20 +71,48 @@ const setupUI = () => {
ui.markdownPreview.innerHTML = ui.showdownConverter.makeHtml('Run an import to see some markdown.');
};

const updateImporterUI = (out) => {
const { md, html: outputHTML, originalURL } = out;
const loadResult = ({ md, html: outputHTML }) => {
ui.transformedEditor.setValue(html_beautify(outputHTML));
ui.markdownEditor.setValue(md || '');

const mdPreview = ui.showdownConverter.makeHtml(md);
ui.markdownPreview.innerHTML = mdPreview;

// remove existing classes and styles
Array.from(ui.markdownPreview.querySelectorAll('[class], [style]')).forEach((t) => {
t.removeAttribute('class');
t.removeAttribute('style');
});
};

const updateImporterUI = (results, originalURL) => {
if (!IS_BULK) {
ui.transformedEditor.setValue(html_beautify(outputHTML));
ui.markdownEditor.setValue(md || '');
IMPORT_FILE_PICKER_CONTAINER.innerHTML = '';
const picker = document.createElement('sp-picker');
picker.setAttribute('size', 'm');

results.forEach((result, index) => {
const { path } = result;

// add result to picker list
const item = document.createElement('sp-menu-item');
item.innerHTML = path;
if (index === 0) {
item.setAttribute('selected', true);
picker.setAttribute('label', path);
picker.setAttribute('value', path);
}
picker.appendChild(item);
});

const mdPreview = ui.showdownConverter.makeHtml(md);
ui.markdownPreview.innerHTML = mdPreview;
IMPORT_FILE_PICKER_CONTAINER.append(picker);

// remove existing classes and styles
Array.from(ui.markdownPreview.querySelectorAll('[class], [style]')).forEach((t) => {
t.removeAttribute('class');
t.removeAttribute('style');
picker.addEventListener('change', (e) => {
const r = results.filter((i) => i.path === e.target.value)[0];
loadResult(r);
});

loadResult(results[0]);
} else {
const li = document.createElement('li');
const link = document.createElement('sp-link');
Expand All @@ -101,6 +132,12 @@ const clearResultPanel = () => {
BULK_URLS_HEADING.innerText = 'Importing...';
};

const clearImportStatus = () => {
importStatus.imported = 0;
importStatus.total = 0;
importStatus.rows = [];
};

const disableProcessButtons = () => {
IMPORT_BUTTON.disabled = true;
};
Expand All @@ -127,6 +164,23 @@ const getProxyURLSetup = (url, origin) => {
};
};

const postImportProcess = async (results, originalURL) => {
await asyncForEach(results, async ({ docx, filename, path }) => {
const data = {
status: 'Success',
url: originalURL,
path,
};

const includeDocx = !!docx;
if (includeDocx) {
await saveFile(dirHandle, filename, docx);
data.docx = filename;
}
importStatus.rows.push(data);
});
};

const createImporter = () => {
config.importer = new PollImporter({
origin: config.origin,
Expand All @@ -140,25 +194,14 @@ const getContentFrame = () => document.querySelector(`${PARENT_SELECTOR} iframe`
const attachListeners = () => {
attachOptionFieldsListeners(config.fields, PARENT_SELECTOR);

config.importer.addListener(async (out) => {
config.importer.addListener(async ({ results }) => {
const frame = getContentFrame();
out.originalURL = frame.dataset.originalURL;
const includeDocx = !!out.docx;
const { originalURL } = frame.dataset;

updateImporterUI(out, includeDocx);
updateImporterUI(results, originalURL);
postImportProcess(results, originalURL);

const data = {
status: 'Success',
url: out.originalURL,
path: out.path,
};
if (includeDocx) {
const { docx, filename } = out;
await saveFile(dirHandle, filename, docx);
data.docx = filename;
}
importStatus.rows.push(data);
alert.success(`Import of page ${frame.dataset.originalURL} completed.`);
alert.success(`Import of page ${originalURL} completed.`);
});

config.importer.addErrorListener(({ url, error: err }) => {
Expand All @@ -168,6 +211,8 @@ const attachListeners = () => {
});

IMPORT_BUTTON.addEventListener('click', (async () => {
clearImportStatus();

if (IS_BULK) {
clearResultPanel();
if (config.fields['import-show-preview']) {
Expand Down Expand Up @@ -196,9 +241,6 @@ const attachListeners = () => {
}
}

importStatus.imported = 0;
importStatus.rows = [];

const field = IS_BULK ? 'import-urls' : 'import-url';
const urlsArray = config.fields[field].split('\n').reverse().filter((u) => u.trim() !== '');
importStatus.total = urlsArray.length;
Expand Down Expand Up @@ -242,14 +284,14 @@ const attachListeners = () => {
const includeDocx = !!dirHandle;

window.setTimeout(async () => {
const { originalURL } = frame.dataset;
const { replacedURL } = frame.dataset;
const { originalURL, replacedURL } = frame.dataset;
if (frame.contentDocument) {
try {
config.importer.setTransformationInput({
url: replacedURL,
document: frame.contentDocument,
includeDocx,
params: { originalURL },
});
await config.importer.transform();
} catch (e) {
Expand Down
44 changes: 30 additions & 14 deletions js/shared/pollimporter.js
Original file line number Diff line number Diff line change
Expand Up @@ -73,46 +73,62 @@ export default class PollImporter {
}

async transform() {
const {
includeDocx, url, document, params,
} = this.transformation;

try {
let out;
if (this.transformation.includeDocx) {
out = await WebImporter.html2docx(
this.transformation.url,
this.transformation.document,
let results;
if (includeDocx) {
const out = await WebImporter.html2docx(
url,
document,
this.projectTransform,
params,
);

const { path } = out;
out.filename = `${path}.docx`;
results = Array.isArray(out) ? out : [out];
results.forEach((result) => {
const { path } = result;
result.filename = `${path}.docx`;
});
} else {
out = await WebImporter.html2md(
this.transformation.url,
this.transformation.document,
const out = await WebImporter.html2md(
url,
document,
this.projectTransform,
params,
);
results = Array.isArray(out) ? out : [out];
}

this.listeners.forEach((listener) => {
listener({
...out,
url: this.transformation.url,
results,
url,
});
});
} catch (err) {
this.errorListeners.forEach((listener) => {
listener({
url: this.transformation.url,
url,
error: err,
});
});
}
}

setTransformationInput({ url, document, includeDocx = false }) {
setTransformationInput({
url,
document,
includeDocx = false,
params,
}) {
this.transformation = {
url,
document,
includeDocx,
params,
};
}

Expand Down
Loading

0 comments on commit fa4610b

Please sign in to comment.