Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For column projection, only a partial indexing is necessary #2

Open
lemire opened this issue May 2, 2019 · 2 comments
Open

For column projection, only a partial indexing is necessary #2

lemire opened this issue May 2, 2019 · 2 comments

Comments

@lemire
Copy link
Collaborator

lemire commented May 2, 2019

Terminology: Selecting a subset of columns is called a projection.

Since extracting the indexes is expensive, you may want to only pick some of the indexes, never committing to memory the unneeded indexes. So maybe you want the 4th index and the 10th one. Then you want the 14th index and the 20th index and so forth. Skipping an index is cheap, it is a single instruction (blsr).

So, conceptually, getting just the indexes you need is easy. Practically, there is a software engineering issue in getting the whole thing to work... but it can be done.

@geofflangdale
Copy link
Owner

Yes, assuming we have 10 indexes. We actually need the 5th and 1st indexes as well to serve as end delimiters.

Skipping indexes could be done en masse by masking out the unwanted indexes. This is simple until you have >64 columns, and not completely awful even then, although it probably doesn't achieve in practice.

One would make up a mask of repeated bits - going from LSB to MSB this would be 10011000001 (if I counted right) repeated in a register and double-shifted and OR'ed together appropriately.

This sounds a bit stupidly exotic, but a similar technique could be used to detect erroneous numbers of fields (too many / too few) by applying it to a bit-selected sequence of separator vs terminator bits. So for 5 fields per line you would be wanting to make sure that we see 0000100001 etc (with the same tricky double-shift and the same problems with >64 columns) as the bit-pattern of terminators within the union of "terminators and separators".

All of this might be overkill compared to simpler non-bit-parallel techniques, but you know me.

@lemire
Copy link
Collaborator Author

lemire commented May 3, 2019

Just so we are clear, I am thinking about working at the bit level. So this is prior to the flatten call. Yes, it is exotic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants