Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert FlexT descriptions into Kaitai ones #292

Open
KOLANICH opened this issue Nov 15, 2017 · 8 comments
Open

Convert FlexT descriptions into Kaitai ones #292

KOLANICH opened this issue Nov 15, 2017 · 8 comments

Comments

@KOLANICH
Copy link

KOLANICH commented Nov 15, 2017

I've found the webpage with one more language to create binary parsers. It also has own library of descriptions. Some of them are missing in kaitai_struct_formats, for example SQLite.
http://geos0.icc.ru/scripts/WWWBinV.dll/Cat

In a e-mail conversation Alexei Hmelnov stated that a one may assume that all the descriptions are under MIT license and that he is going to add a license to the site when he have time.

The script to download the formats from the site
#!/usr/bin/env python3
import encodings
import json  # don't optionally replace with ujson, serializes differently!
import os
from collections import OrderedDict
from pathlib import Path

import bs4
import dateutil.parser
import ratelimit  # https://github.com/tomasbasham/ratelimit
from tqdm import tqdm

try:
	import httpx
except ImportError:
	import requests as httpx

base = "http://geos0.icc.ru"
catalog = base + "/scripts/WWWBinV.dll/Cat"


def getCharset(soup: bs4.BeautifulSoup, default: str = "utf-8") -> str:
	el = soup.select_one("head > meta[http-equiv=Content-Type]")
	if el:
		for part in el["content"].split(";"):
			if "=" in part:
				part = part.split("=")
				if len(part) > 1 and part[0].lower == "charset":
					return part[1]
	return default


def bin2soup(bin: (bytes, bytearray)) -> bs4.BeautifulSoup:
	enc = "windows-1251"
	str = bin.decode(encoding=enc, errors="replace")
	soup = bs4.BeautifulSoup(str, "html5lib")
	enc1 = getCharset(soup, "windows-1251")
	if enc != enc1:
		str = bin.decode(encoding=enc1)
		soup = bs4.BeautifulSoup(str, "html5lib")
	return soup


def buildIndex(targetDir: Path) -> OrderedDict:
	catalogCacheFile = targetDir / "Cat"
	if not catalogCacheFile.is_file():
		catalogTextEncoded = httpx.get(catalog).content
		catalogCacheFile.write_bytes(catalogTextEncoded)
	else:
		print("Index source is already present. Delete " + str(catalogCacheFile) + " to regenerate")
		catalogTextEncoded = catalogCacheFile.read_bytes()

	parsed = bin2soup(catalogTextEncoded)
	table = parsed.select_one("table")
	rows = table.select("tr")
	res = {}
	header = [el.text.strip().lower() for el in rows[0].select("td")]

	rows = rows[1:]
	rowsRes = OrderedDict()
	for row in rows:
		rowRes = OrderedDict(zip(header, [el.text.strip() for el in row.select("td")]))
		# rowRes["date"]=dateutil.parser.parse(rowRes["date"])
		rowRes["uri"] = base + row.select_one("a[href]")["href"]

		cat = rowsRes
		if rowRes["class"] not in cat:
			cat[rowRes["class"]] = {}
		cat = cat[rowRes["class"]]
		if rowRes["status"] not in cat:
			cat[rowRes["status"]] = {}
		cat = cat[rowRes["status"]]

		cat[rowRes["file"]] = rowRes
	return rowsRes


def writeSource(soup: bs4.BeautifulSoup, fileName: Path) -> None:
	meta = {}
	for el in soup.select("head > meta"):
		if "name" in el.attrs:
			meta[el.attrs["name"]] = el.attrs["content"]
	metaStr = soup.select_one("font").text
	source = soup.select_one("pre").text

	source = "% " + metaStr + "\n" + "% " + json.dumps(meta) + "\n\n" + source

	with fileName.open("wt", encoding="utf-8") as f:
		f.write(source)


@ratelimit.rate_limited(period=2)
def downloadFormat(uri: str, path: Path) -> None:
	req = httpx.get(uri)
	soup = bin2soup(req.content)
	writeSource(soup, path)


def downloadFormats(index: OrderedDict, targetDir: Path) -> None:
	for clsName, cls in tqdm(index.items(), desc="Classes"):
		clsPath = targetDir / clsName.replace(".", "").replace("/", "").replace("\\", "")
		for statusName, status in tqdm(cls.items(), desc="Statuses in " + clsName):
			statusPath = clsPath / statusName.replace(".", "").replace("/", "").replace("\\", "")
			os.makedirs(str(statusPath), mode=0o771, exist_ok=True)
			for formatName, formatDescr in tqdm(status.items(), desc="Formats in " + statusName):
				formatPath = statusPath / formatName.replace("/", "").replace("\\", "")
				tqdm.write("downloading : " + formatDescr["uri"] + " -> " + str(formatPath))
				downloadFormat(formatDescr["uri"], formatPath)


def main() -> None:
	targetDir = Path(".")
	indexCacheFile = targetDir / "cat.json"
	if not indexCacheFile.is_file():
		with indexCacheFile.open("wt", encoding="utf-8") as f:
			index = buildIndex(targetDir)
			json.dump(index, f, indent="\t")
	else:
		print("Index is already present. Delete " + str(indexCacheFile) + " to regenerate")
		with indexCacheFile.open("rt", encoding="utf-8") as f:
			index = json.load(f)
	downloadFormats(index, targetDir)


if __name__ == "__main__":
	main()

He also have provided me with a formal grammar of his language.

@GreyCat
Copy link
Member

GreyCat commented Nov 16, 2017

Thanks, that's another really interesting gem that I had no idea about! Looks like the project is kind of abandoned, though :(

for example SQLite.

Hmm, actually I recall doing a pretty convinicing implementation of sqlite... Have I forgot to commit it?..

In a e-mail conversation Alexei Hmelnov stated that a one may assume that all the descriptions are under MIT license and that he is going to add a license to the site when he have time.

Yeah, I've got your e-mail, thanks! Confirming his intention to make it available under MIT-like license as well.

AFAIU, Alexei used Coco/R for grammar description, which technically can generate parsers for Python or Ruby, so probably it won't be that hard to create a script that would do the conversion.

@GreyCat
Copy link
Member

GreyCat commented Nov 16, 2017

set byteorder rev
set byteorder norm

/me facepalms

@KOLANICH
Copy link
Author

@GreyCat, I have already found, that it is a CoCo/R grammar, but there is some issues with the python generator / the grammar file itself.

The generator says that the grammar file is invalid. In fact it looks like it is really invalid for vanilla CoCo/R, for example it uses ) ( instead of } { and identifier* as in regexps instead of CoCo/R's {identifier}.

I have replaced these stuff with valid CoCo/R, but it still fails in another place, I'm not very familiar with coco/R , but another grammar with the same tokens works fine. Maybe there is still something wrong in this grammar, but I am not sure untill I finished fixing the generator to be able to run the test suite.

@GreyCat
Copy link
Member

GreyCat commented Nov 16, 2017

Well, may be manual conversion would be more feasible in this case. Probably we should do a list of formats available there which are not yet available for KS? Quite a few of these formats are available in our repo already.

@KOLANICH
Copy link
Author

KOLANICH commented Nov 16, 2017

The script in the first post does create the list. There are too many formats we don't have, so I guess we need a tool.

But a manual conversion of the grammar to some alive parser like ply or parglare would be totally feasible, if I couldn't make pyCoCo/R work.

@KOLANICH
Copy link
Author

KOLANICH commented Nov 23, 2017

Failed to make CoCoPy to compile that grammar. (CoCo/R is LL(1), but the thing seems to use some extensions beyond LL(1) and some additional, I haven't found any description or an implementation of CoCo/R using these extensions). Converted the grammar to the parglare one (parglare claims to be the fastest parser generator (but we know that CoCoPy is the fastest because it generates the code and because it is LL(1)) for python parglare generates parsers for LR and GLR, GLR, which is more powerful). The grammar compiles but doesn't work even in GLR mode (says that "\n" is expected) :(. Damn slow even in LR mode. Need future debugging.

@NotWearingPants
Copy link

Original link is broken (when you click a file type), here's a new one-
http://geos0.icc.ru/scripts/WWWBinV.dll/Cat

@KOLANICH
Copy link
Author

KOLANICH commented Jun 3, 2023

Thanks, @NotWearingPants, for letting us know. Thanks, @generalmimon, for fixing.

@NotWearingPants, I have fixed some bugs within the download script too. If you want to work on the flext converter, https://github.com/KOLANICH-specs/kaitai_struct_formats/tree/FlexT/flext may be helpful. It is a temporary branch, should not be PRed as it is. Just a scratch space for my convenience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants