Pdoxcl2Sharp is a general parser for files related to Paradox Interactive. While the parser is aimed towards Paradox Interactive, it is not exclusive, meaning that any file or configuration written in a similar style can be parsed without problems.
- Speed: Seriously. The parser was written to rip through 50MB files as fast as possible
- Reliability: There are a 100+ tests to ensure that the parser can handle any situation
- Encoding: The parser handles the encoding and decoding of files so that all your letters render fine. I'm looking at some of you guys (šžŸ)
- Ease of use: Don't worry about if something contains quotes, if a list spans multiple lines, or about brackets. The parser takes care of everything. The parser only shows you what you care about.
- Saving: You can as easily write info as parse it.
- Lossless Compression: If you don't care about a pretty output, you can compress what is written and it will still be read successfully from the parser and Paradox, hence the phrase "lossless". You can achieve compression ratios up to three (so your new file will be three times smaller than the old).
- No dependencies: Written in pure managed C#, relying on no other libraries, Pdoxcl2Sharp has seamless integration into any situation
Many of those who play Paradox Interactive's games are also programmers; however, one of the biggest hurdles to developing a tool is the actual parsing of the files. Parsing is a nontrivial problem that is hard to get right. This project aims to be 100% compatible with the parser Paradox uses. The end is to eliminate of all the boiler plate code and provide a "plug and parse" mechanism.
Pdoxcl2Sharp can be installed from NuGet:
PM> Install-Package Pdoxcl2Sharp
If you will be doing all the parsing manually, then all you need to do is grab the package from the latest release, and add the appropriate reference in your favorite IDE.
If you want to use the parser generator, ParseTemplate.tt, which I highly recommend, then you'll have to grab that file as well, and then modify the text in between the commented "add" section. Add a reference to the downloaded .dll in the generator.
Say you want to parse this file:
# Hey, I'm a comment, put me anywhere and everything
# until the end of line won't matter and will be chucked!
theoretical = {
infantry_theory
militia_theory
mobile_theory
}
Here's how:
// Here we define a class that says "I can read and write myself"
public class TheoreticalFile : IParadoxRead, IParadoxWrite
{
IList<string> theories = new List<string>();
// This is the read interface. Whenever the parser finds something
// interesting it will pass it to this function as the token. From
// there, we decide if we want to process it further. This function
// will never receive the whitespace, null, a bracket, etc. With the
// example, token will be "theoretical", "infantry_theory",
// "militia_theory", and "mobile_theory"
public void TokenCallback(ParadoxParser parser, string token)
{
// Hey, we know what to do when we see "theoretical"!,
// we know that it is simply a list of strings.
if (token == "theoretical")
{
theories = parser.ReadStringList();
}
}
// This is the write interface, and unlike the read interface
// this is not a callback. This simply means that given a writer
// the class will dump its contents there.
public void Write(ParadoxStreamWriter writer)
{
writer.WriteComment("Hey, I'm a new comment");
// Write the header for the list. Notice that we include a bracket.
// The writer will automatically indent subsequent lines by an
// additional tab
writer.WriteLine("theoretical = {");
foreach (var theory in theoryFile.theories)
{
writer.WriteLine(theory, ValueWrite.LeadingTabs);
}
writer.WriteLine("}");
}
}
public static int Main()
{
TheoreticalFile theoryFile;
using (FileStream fs = new FileStream("theories.txt", FileMode.Open))
{
theoryFile = ParadoxParser.Parse(fs, new TheoreticalFile());
}
// Save the information into a new file
using (FileStream fs = new FileStream("theories.new.txt", FileMode.Create))
using (ParadoxSaver saver = new ParadoxSaver(fs))
{
theoryFile.Write(saver);
}
}
So the previous example was fine and all to write by hand, but as one can imagine the more complex a document that needs to be parsed is, the more complex the code to write the parser is. Luckily, we can take advantage of T4 templates in the ParseTemplate.tt file to write the reading and writing code. You'll gain a lot from using the parser template. For every one line of code specified in the T4 document, expect at least six lines of code written for you. The greatest thing is understanding how to modify the T4 template is insanely simple. If you're trying to integrate the template into your project for the first time, here are a couple of instructions that you need to execute one.
- Locate the line
<#@ assembly name="$(SolutionDir)\Pdoxcl2Sharp\bin\Pdoxcl2Sharp.dll" #>
. This tells the template where to find some code needed to run the template. Right now, the value points to the build directory for this project, which will obviously differ for your project. You should take the Pdoxcl2Sharp.dll and move it a vendor directory in your project. Change the assembly line in the template to the new location, which should now be<#@ assembly name="$(SolutionDir)\vendor\Pdoxcl2Sharp.dll" #>
- Locate
namespace Pdoxcl2Sharp.Test
and change it to the desired namespace
Now we get to the good stuff. The T4 template works off an array whose elements
contain a Name, which will be the generated class's name, and an array of
properties. Each property has a required Name and Type field. The type of the
property is the literal string representation of the type eg. "string", "int",
etc and from this type, the template knows what to generate for reading and
writing. The template tries to guess what the property looks like in the text
by transforming the Name field. For instance, a property Name of
"GoodLookingMan" will cause the template to look for "good_looking_man" in the
text. The template is incredibly smart about this. Given an "IList" with
a property name of "Armies", the template will generate code for reading and
writing individual "army" elements. Naming can be overridden by explicitly
specifying the Alias
field.
The last thing that you need to tell the template is if the string you are
writing should contain quotes surrounding it. Do this by specifying the Quoted = true
field.
Here is a more advanced example using what we just talked about. Let's say you want to parse the following file (it is slightly unrealistic, but it serves as a good example)
name = "My Prov"
tax = 1.000
add_core=MEE
add_core=YOU
add_core=THM
top_provinces={ "BNG" "ORI" "PEG" }
army={
name = "My first army"
leader = { id = 5 }
unit = {
name = "First infantry of Awesomeness"
type = ninjas
morale = 5.445
strength = 0.998
}
unit = {
name = "Second infantry of awesomeness"
type = ninjas
morale = 6.000
strength = 1.000
}
}
You could manually code this up in an hour, or you could modify ParseTemplate.tt and in a few minutes and 35 lines of code, you could have the same thing! Here are the relevant lines added to ParseTemplate.tt:
var classes = new[] {
new {
Name = "Province",
Props = new[] {
new Property() { Type = "string", Name = "Name", Quoted = true },
new Property() { Type = "double", Name = "Tax" },
new Property() { Type = "IList<string>", Name = "Cores", Alias = "add_core" },
new Property() { Type = "[ConsecutiveElements] IList<string>",
Name = "TopProvinces", Quoted = true},
new Property() { Type = "IList<Army>", Name = "Armies"}
}
},
new {
Name = "Army",
Props = new[] {
new Property() { Type = "string", Name = "Name", Quoted = true },
new Property() { Type = "Leader", Name = "Leader" },
new Property() { Type = "IList<Unit>", Name = "Units" }
}
},
new {
Name = "Unit",
Props = new[] {
new Property() { Type = "string", Name = "Name", Quoted = true },
new Property() { Type = "string", Name = "Type" },
new Property() { Type = "double", Name = "Morale" },
new Property() { Type = "double", Name = "Strength" }
}
},
new {
Name = "Leader",
Props = new[] {
new Property() { Type = "int", Name = "Id" }
}
}
};
Here's the generated auto-generated content:
using System;
using Pdoxcl2Sharp;
using System.Collections.Generic;
namespace Pdoxcl2Sharp.Test
{
public partial class Province : IParadoxRead, IParadoxWrite
{
public string Name { get; set; }
public double Tax { get; set; }
public IList<string> Cores { get; set; }
public IList<string> TopProvinces { get; set; }
public IList<Army> Armies { get; set; }
public Province()
{
Cores = new List<string>();
Armies = new List<Army>();
}
public void TokenCallback(ParadoxParser parser, string token)
{
switch (token)
{
case "name": Name = parser.ReadString(); break;
case "tax": Tax = parser.ReadDouble(); break;
case "add_core": Cores.Add(parser.ReadString()); break;
case "top_provinces": TopProvinces = parser.ReadStringList(); break;
case "army": Armies.Add(parser.Parse(new Army())); break;
}
}
public void Write(ParadoxStreamWriter writer)
{
if (Name != null)
{
writer.WriteLine("name", Name, ValueWrite.Quoted);
}
writer.WriteLine("tax", Tax);
foreach(var val in Cores)
{
writer.WriteLine("add_core", val);
}
if (TopProvinces != null)
{
writer.Write("top_provinces={ ");
foreach (var val in TopProvinces)
{
writer.Write(val, ValueWrite.Quoted);
writer.Write(" ");
}
writer.WriteLine("}");
}
foreach(var val in Armies)
{
writer.Write("army", val);
}
}
}
public partial class Army : IParadoxRead, IParadoxWrite
{
public string Name { get; set; }
public Leader Leader { get; set; }
public IList<Unit> Units { get; set; }
public Army()
{
Units = new List<Unit>();
}
public void TokenCallback(ParadoxParser parser, string token)
{
switch (token)
{
case "name": Name = parser.ReadString(); break;
case "leader": Leader = parser.Parse(new Leader()); break;
case "unit": Units.Add(parser.Parse(new Unit())); break;
}
}
public void Write(ParadoxStreamWriter writer)
{
if (Name != null)
{
writer.WriteLine("name", Name, ValueWrite.Quoted);
}
if (Leader != null)
{
writer.Write("leader", Leader);
}
foreach(var val in Units)
{
writer.Write("unit", val);
}
}
}
public partial class Unit : IParadoxRead, IParadoxWrite
{
public string Name { get; set; }
public string Type { get; set; }
public double Morale { get; set; }
public double Strength { get; set; }
public Unit()
{
}
public void TokenCallback(ParadoxParser parser, string token)
{
switch (token)
{
case "name": Name = parser.ReadString(); break;
case "type": Type = parser.ReadString(); break;
case "morale": Morale = parser.ReadDouble(); break;
case "strength": Strength = parser.ReadDouble(); break;
}
}
public void Write(ParadoxStreamWriter writer)
{
if (Name != null)
{
writer.WriteLine("name", Name, ValueWrite.Quoted);
}
if (Type != null)
{
writer.WriteLine("type", Type);
}
writer.WriteLine("morale", Morale);
writer.WriteLine("strength", Strength);
}
}
public partial class Leader : IParadoxRead, IParadoxWrite
{
public int Id { get; set; }
public Leader()
{
}
public void TokenCallback(ParadoxParser parser, string token)
{
switch (token)
{
case "id": Id = parser.ReadInt32(); break;
}
}
public void Write(ParadoxStreamWriter writer)
{
writer.WriteLine("id", Id);
}
}
}
If you do all the parsing by hand then there is no need to worry about this; however, if you are using the generator then there are a couple of ambiguous situations. Let's say you have a list of strings representing factories:
factories={Here There Everywhere}
This is also valid
factory = Here
factory = There
factory = Everywhere
It would prohibitively expensive to support both variations with a single list, hence the attribute must be appended lists that are in the format of the first factory example.
Consecutive Elements for non-primitive types is even more interesting as there
three good representations for the list, and all three are used in practice.
Let's say there is a list of Attachments
that we want parsed, the three ways
to encounter this list are as follows.
// First method - individual occurences (non-consecutive elemtents)
attachment={...}
attachment={...}
// Second method - nested structures
attachments={ {...} {...} }
// Third method - nested structures with headers
attachments={ attachment={...} attachment={...}}
The second and third methods have consecutive elements, and in a tough decision it was decided that the parser would support the second method in automatic parsing, whereas the third method involves the client fleshing out the parsing structure.
To parse the second method, use the following line:
list = parser.ReadList(() => parser.Parse(new Attachment()));
Popular libraries for XML, JSON, YAML, CSV, and basically any structured text
document has an automatic deserialization feature, where all one has to do is
call Deserialize<MyType>(text)
and they will receive a populated object back.
There currently is a partially implemented version, which can be invoked with
ParadoxParser.Deserialize<MyType>()
. It does everything fine except
consecutive list elements. Please feel free to implement this one missing
feature. I doubt I'll work on it as I don't see the benefits right now, as will
be explained.
It's hard to support automatic deserialization, as Paradox files are not
rigorously structured. For instance, in JSON there is only one-way to denote a
list a: [...]
, but in a paradox file there are multiple
ways to handle lists. Sure, there is
a way around this if we force the user to denote everything with custom
attributes. The only problem would be the number of attributes might get out of
control. An aliased list with consecutive elements of quoted strings would need
three attributes to deserialize and serialize correctly. To me this seems
pretty cluttered. Another technical reason that makes it hard is that objects
can be in a partially deserialized state such as a list. We know that once
a:[...]
is read that a
is fully constructed, but if ou construct the list
out of individual a
elements, you have a situation with a partially
constructed list. This isn't a problem by itself; it just exasperates current
problems with parsing lists.
Another reason why I'm not a fan is that deserializeing something is inherently slow if done dynamically and especially something as complex as Paradox files. Some of the use cases involve parsing a 50MB save file with a deep and complex hierarchical tree, and parsing 10,000 similarly structured small files. Any time spent discovering how to deserialize an object will be too long. An interesting notion is precompiling like protobuf-net, as there really isn't a need to rediscover how to deserialize an object on each program invocation. I've toyed with implementing this, but I've decided against this, as I'm unsure of the value that would be added.
Some popular libraries like ServiceStack.Text use a concurrent dictionary to store type and the function that will parse out the type after this function has been computed. This is a great time saver but it is still too much time. Think about it this way. Say we have 1.5 million lines with 100,000 objects of maybe three dozen different kinds. Even if the dictionary was preloaded, a 100,000 lookups on a concurrent dictionary is fast, but definitely non-negligible. For those curious, in this instance a single threaded regular dictionary would be faster, but there would be no caching bonus across parallel deserialiations (think back to the 10,000 file example).
These methods are the heart of the parsing. Whenever the parser comes across something interesting, it invokes the callback with what it found. The token found will never be a comment, an equal sign, brackets, empty, null or a string that has quotes (the quotes are stripped). ReadString has the exact same behavior
If there are child structures that are being parsed (delimited by squirrely
brackets {}
) and the parent and the children share the same tokens such as
"name" and "id", these can be differentiated by querying CurrentIndent
. The
children will be at a higher indent than the parents. Perhaps a better
solution to this approach is to define a separate class for the children or an
Action<ParadoxParser, string>
and invoke Parse
on the parser.
So you want to help? Great! Here are a series of steps to get you on your way! Let me first say that if you have any troubles, file an issue.
- Get a github account
- Fork the repo
- Add a failing test. The purpose of this is to show that what you are adding couldn't have been done before, or was wrong.
- Add your changes
- Commit your changes in such a way there is only a single commit difference between the main branch here and yours. If you need help, check out git workflow
- Push changes to your repo
- Submit a pull request and I'll review it!
Install dotnet SDK 2.0 and then
dotnet build
If you have Visual Studio 2017, I believe you can just hit "Build"
This section describes in more detail the style of files that can be parsed.
The most important characters to the parser are =
, "
, }
, {
, #
, ,
, and
whitespace. Let's call this set lexemes. The complement of lexemes is the
untyped. Lexemes delimit tokens, which are composed of untyped characters.
The parsing process goes as such:
- Read a character and test to see if it is a lexeme.
- If a command peeked at the next lexeme, return that
- else advance through whitespace until a lexeme
- If that lexeme is a comment, advance until newline and go back to step 1
- else if
{
increase indent by one - else if
}
decrease indent by one - return encountered lexeme
- If lexeme is a quote, advance until closing quote
- Everything enclosed in the quotes excluding the quotes is a token
- Lexemes are considered untyped wrapped in quotes
- else if any other lexeme go back to step 1
- else if the character read was untyped
- All characters until whitespace or a lexeme is considered a token
- Return token
This is probably, at best, hard to follow. Hopefully, a couple of examples will clear some of the confusion.
player = "AAA"
player= = "AAA"
Both are parsed (ie, the tokens returned to the client) the same way player
and then AAA
. An explanation of this is that all whitespace is skipped until
something interesting is encountered. Then player
is grabbed. From there,
the next token starts at the quotes. Remember that tokens are composed
strictly of untyped characters, hence the second equals is completely ignored.
This leads to an interesting example showcasing the parser's flexibility.
ids = { 1 2 3 4 }
ids={1 2 3 4 }
ids = {
1, 2,
3, 4,
}
All three snippets will return ids
, 1
, 2
, 3
, and 4
as tokens. The
extra comma at the end of the '4' is even on purpose.
Pdoxcl2Sharp is licensed under MIT, so feel free to do whatever you want, as long as this license follows the code.