Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse Markdown (Wikipedia/Entity Linking) #8

Open
adibaba opened this issue Dec 12, 2022 · 0 comments
Open

Parse Markdown (Wikipedia/Entity Linking) #8

adibaba opened this issue Dec 12, 2022 · 0 comments
Labels
enhancement New feature or request

Comments

@adibaba
Copy link
Owner

adibaba commented Dec 12, 2022

Wikipedia markdown could appropriately parsed.
Candidate: https://github.com/vsch/flexmark-java
is fork of https://github.com/commonmark/commonmark-java

sirthias/parboiled uses sirthias/pegdown which is depricated and suggests vsch/flexmark-java

Example without Table parsing, but with LinkRef extraction:
https://en.wikipedia.org/w/api.php?action=parse&prop=wikitext&format=json&oldid=1126125069

package org.dice_research.launuts.linking;

import java.io.File;

import org.dice_research.launuts.Config;
import org.dice_research.launuts.io.Io;
import org.jetbrains.annotations.NotNull;
import org.json.JSONObject;

import com.vladsch.flexmark.parser.Parser;
import com.vladsch.flexmark.util.ast.Node;
import com.vladsch.flexmark.util.ast.NodeVisitor;
import com.vladsch.flexmark.util.ast.TextCollectingVisitor;
import com.vladsch.flexmark.util.ast.VisitHandler;
import com.vladsch.flexmark.util.ast.Visitor;

public class WikipediaLinking {

	public static final String PREFIX_WP_OLDID = "https://en.wikipedia.org/w/api.php?action=parse&prop=wikitext&format=json&oldid=";
	public static final String WP_NUTS1EU_OLDID = "1126125069";
	public static final String WP_NUTS1EU_FILENAME = "NUTS-1-EU.json";

	public static File getWpNuts1euFile() {
		return new File(Config.get(Config.KEY_DOWNLOAD_DIRECTORY), WP_NUTS1EU_FILENAME);
	}

	/**
	 * Downloads NUTS 1 sources from 17:45, 7 December 2022.
	 * 
	 * @see https://en.wikipedia.org/w/index.php?title=First-level_NUTS_of_the_European_Union&oldid=1126125069
	 * @see https://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=First-level_NUTS_of_the_European_Union&rvslots=*&rvprop=content
	 */
	public WikipediaLinking downloadWpNuts1Eu() {
		Io.download(PREFIX_WP_OLDID + WP_NUTS1EU_OLDID, getWpNuts1euFile(), false);
		return this;
	}

	private String getNutsWikisource() {
		return new JSONObject(Io.readFileToString(getWpNuts1euFile())).getJSONObject("parse").getJSONObject("wikitext")
				.getString("*");
	}

	public static void main(String[] args) {
		String markdown = new WikipediaLinking().downloadWpNuts1Eu().getNutsWikisource();

		if (Boolean.FALSE) {
			System.out.println(markdown);
		}

		// https://github.com/vsch/flexmark-java/blob/0.64.0/flexmark-java-samples/src/com/vladsch/flexmark/java/samples/BasicSample.java
		Parser parser = Parser.builder().build();
		Node document = parser.parse(markdown);

		if (Boolean.FALSE) {
			VisitorSmpl visitorSmpl = new WikipediaLinking().new VisitorSmpl();
			visitorSmpl.visit(document);
			System.out.println(visitorSmpl.getText());
		}

		if (Boolean.FALSE) {
			TextCollectingVisitor textCollectingVisitor = new TextCollectingVisitor();
			System.out.println(textCollectingVisitor.collectAndGetText(document));
		}

	}

	/**
	 * Usage: VisitorSmpl visitorSmpl = new WikipediaLinking().new VisitorSmpl();
	 * visitorSmpl.visit(document); System.out.println(visitorSmpl.getText());
	 * 
	 * @see https://github.com/vsch/flexmark-java/blob/0.64.0/flexmark-java-samples/src/com/vladsch/flexmark/java/samples/VisitorSample.java
	 */
	public class VisitorSmpl implements Visitor<Node> {
		NodeVisitor visitor = new NodeVisitor(new VisitHandler<>(Node.class, this::visit));
		StringBuilder sb = new StringBuilder();

		@Override
		public void visit(@NotNull Node node) {
			sb.append(node.getChars().unescape());
			visitor.visitChildren(node);
		}

		public String getText() {
			return sb.toString();
		}
	}

}

Also see #12 and #13

@adibaba adibaba added the enhancement New feature or request label Dec 13, 2022
@adibaba adibaba changed the title Parse Markdown Parse Markdown (Wikipedia/Entity Linking) Dec 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant