[Request] Ability to use OR statements within fread #2185

bg49ag · 2017-06-03T21:15:07Z

It would be handy if there was some way to use OR statements within fread.

Some of the information I work with has both variable column positions and field names; but there is only ever one form of a field name present at any one time.

As far as I'm aware, this isn't available with stock data.frames but perhaps there's some way to add the functionality in data.tables. I think it would require a consistent name to assigned first within the function; e.g.

select = c('a | a1 | a2', 'b | b1 | b2'),
col.names = c('a | a1 | a2' = 'a', 'b | b1 | b2' = 'b'),
colClasses = c(a = 'class', b = 'class'),

To avoid collisions it would only need to warn if it encounters more than one OR variable within a file.

MichaelChirico · 2017-06-03T21:45:11Z

Possible duplicate/extension of #2066

bg49ag · 2017-06-05T17:48:34Z

Well spotted! #2066 is essentially the same idea.

I ended up using a similar approach to account for the problem ('manually' re-writing the headers). This works but it's time consuming (kind of the opposite of data.table itself) and it could potentially not be applicable; e.g. for legal records, re-writing the field names could be construed as tampering with an original data set.

Another approach is to explicitly state each header and load everything outside of a loop. Not too bad if it's only one or two, or ten. Awful when it's hundreds, thousands, hundreds of thousands.

Being able to state two or more field names per field to import would be a very quick solution.

With regards to this, I would also suggest possibly adding an in built field name surveying argument within fread. E.g. say if I have a directory with ten thousand files in it, the ability to simply point to the directory, have it pull all the headers and drop them into data.table with the file name next to it. An additional function could be to only show those that differ in some way. This is can already be done with fread, by looping through, getting the first nrows and binding it but it'd be even easier if there was some form of survey argument within fread that'd automate the process.

I've done the above by writing some code myself then exported the results (for a smaller directory) to Excel to produce a pictogram showing the resulting structure; example pictogram, were each 'row' is the next file, file names are down the left, and each colour is a different field name, with the field names in the key on the right. This allows for a visual examination of what is going on within a directory; for example, the fact many of the coloured vertical lines don't line up indicates structural changes have occurred (as their shading is specific to a field name). Maybe there's some way to tie the output of a fread field name sweep directly into an R plot to produce something similar.

Edit: The field names also change in the picogram example; e.g., 'REPT_DT', goes to 'rept_dt' and then ' rept_dt' (with a space before it), 'GNDR_COD', goes to 'gndr_cod', and then 'sex'.

st-pasha added feature request fread labels Jun 28, 2017

st-pasha mentioned this issue Jul 6, 2017

Master task for fread bugs / proposals #2247

Closed

jangorecki added the duplicate label Jan 7, 2024

jangorecki closed this as completed Jan 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Request] Ability to use OR statements within fread #2185

[Request] Ability to use OR statements within fread #2185

bg49ag commented Jun 3, 2017

MichaelChirico commented Jun 3, 2017

bg49ag commented Jun 5, 2017 •

edited

Loading

[Request] Ability to use OR statements within fread #2185

[Request] Ability to use OR statements within fread #2185

Comments

bg49ag commented Jun 3, 2017

MichaelChirico commented Jun 3, 2017

bg49ag commented Jun 5, 2017 • edited Loading

bg49ag commented Jun 5, 2017 •

edited

Loading