Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Request] Ability to use OR statements within fread #2185

Closed
bg49ag opened this issue Jun 3, 2017 · 2 comments
Closed

[Request] Ability to use OR statements within fread #2185

bg49ag opened this issue Jun 3, 2017 · 2 comments

Comments

@bg49ag
Copy link

bg49ag commented Jun 3, 2017

It would be handy if there was some way to use OR statements within fread.

Some of the information I work with has both variable column positions and field names; but there is only ever one form of a field name present at any one time.

As far as I'm aware, this isn't available with stock data.frames but perhaps there's some way to add the functionality in data.tables. I think it would require a consistent name to assigned first within the function; e.g.

select = c('a | a1 | a2', 'b | b1 | b2'),
col.names = c('a | a1 | a2' = 'a', 'b | b1 | b2' = 'b'),
colClasses = c(a = 'class', b = 'class'),

To avoid collisions it would only need to warn if it encounters more than one OR variable within a file.

@MichaelChirico
Copy link
Member

Possible duplicate/extension of #2066

@bg49ag
Copy link
Author

bg49ag commented Jun 5, 2017

Well spotted! #2066 is essentially the same idea.

I ended up using a similar approach to account for the problem ('manually' re-writing the headers). This works but it's time consuming (kind of the opposite of data.table itself) and it could potentially not be applicable; e.g. for legal records, re-writing the field names could be construed as tampering with an original data set.

Another approach is to explicitly state each header and load everything outside of a loop. Not too bad if it's only one or two, or ten. Awful when it's hundreds, thousands, hundreds of thousands.

Being able to state two or more field names per field to import would be a very quick solution.

With regards to this, I would also suggest possibly adding an in built field name surveying argument within fread. E.g. say if I have a directory with ten thousand files in it, the ability to simply point to the directory, have it pull all the headers and drop them into data.table with the file name next to it. An additional function could be to only show those that differ in some way. This is can already be done with fread, by looping through, getting the first nrows and binding it but it'd be even easier if there was some form of survey argument within fread that'd automate the process.

I've done the above by writing some code myself then exported the results (for a smaller directory) to Excel to produce a pictogram showing the resulting structure; example pictogram, were each 'row' is the next file, file names are down the left, and each colour is a different field name, with the field names in the key on the right. This allows for a visual examination of what is going on within a directory; for example, the fact many of the coloured vertical lines don't line up indicates structural changes have occurred (as their shading is specific to a field name). Maybe there's some way to tie the output of a fread field name sweep directly into an R plot to produce something similar.

Edit: The field names also change in the picogram example; e.g., 'REPT_DT', goes to 'rept_dt' and then ' rept_dt' (with a space before it), 'GNDR_COD', goes to 'gndr_cod', and then 'sex'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants