-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(io): complete read_csv #183
Conversation
Thanks @yingmanwumen , I may need more time to review this PR. |
@yingmanwumen I think the way we are comparing floating numbers in Python is not good enough. The |
tests/test_io.py
Outdated
}) | ||
], | ||
) | ||
def test_constructors( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps another name such as test_inputs
?
tests/test_io.py
Outdated
@pytest.mark.parametrize( | ||
"test_method, args, kwargs, expected_value", | ||
[ | ||
(ul.read_csv, (), { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if the error messages would be raised as expected, if the schema
dtype doe not match actual dtype. So shall we add some more tests like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other cases could be:
- The
schema.len()
is greater than the number of columns in the.csv
file; - The
schema.len()
is less than the number of columns in the.csv
file; - The column name does not match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other cases could be:
* The `schema.len()` is greater than the number of columns in the `.csv` file; * The `schema.len()` is less than the number of columns in the `.csv` file; * The column name does not match.
Hello,
the strategy I chose is:
- If the field in
schema
doesn't exists in the.csv
file, thenulist
will return an empty list for it, such as{..., "bar": [], ...}
- If the field in
.csv
file doesn't exists inschema
, thenulist
will ignore it
So currently, cases above can not cause an exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For 2., I agree that the result of read_csv
could be a subset of then content of .csv
file, so could you also add another test case, where the schema has fewer columns than the csv file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For 1., let's do some research on mainstream data analysis, data processing libs or even databases to see what kind of behavior would be better.
- Pandas, the most popular
DataFrame
lib: Theread_csv
function will always read all the columns from csv file. No matter how many col names are in the schema. See the Doc
Suppose thetmp.csv
file contains only columnsa
andb
>>> import pandas as pd
>>> pd.read_csv("tmp.csv")
a b
0 1 2
>>> pd.read_csv("tmp.csv", dtype={'a': int, 'b': int})
a b
0 1 2
>>> pd.read_csv("tmp.csv", dtype={'a': int, 'b': int, 'c': int})
a b
0 1 2
>>> pd.read_csv("tmp.csv", dtype={'a': int, 'b': int, 'c': int, 'd': int})
a b
0 1 2
>>> pd.read_csv("tmp.csv", dtype={'c': int, 'd': int})
a b
0 1 2
- SQL, the most used language for database: It does a very strict check for column names, and will raise an error if the name does not exist in the data base. For example, if there is table named
tmp
which contains columna
andb
. The following script will just not work.
SELECT CAST(tmp.c AS INT) as c from tmp
If you don't mind, would you help test the behavior of pyarrow, which is also one of the best data processing lib.
Thanks so much~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And please feel free to investigate any other popular projects and share your idea here. The reason why we have to be cautious is because in real industry usage, the number of columns could be 10, 20 , 100 or even thousands. and it's very common that the developers type the incorrect col names in the schema
accidentally. So simply ignore the typo, raise a warning or an error is an important topic to discuss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For 2., I agree that the result of
read_csv
could be a subset of then content of.csv
file, so could you also add another test case, where the schema has fewer columns than the csv file?
Sure, the test case is appended in the newest commit b86fc43:
...
(ul.read_csv, (), { # schema.len() < field.len()
"path": "./test_csv/04_test_nan.csv",
"schema": {"int": "int",
"bool": "bool"}
}, {
"int": [None, 2, 3, 4],
"bool": [True, False, True, None]
}),
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For 1., let's do some research on mainstream data analysis, data processing libs or even databases to see what kind of behavior would be better.
1. Pandas, the most popular `DataFrame` lib: The `read_csv` function will always read **all the columns** from csv file. No matter how many col names are in the schema. See the [Doc](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) Suppose the `tmp.csv` file contains only columns `a` and `b`
>>> import pandas as pd >>> pd.read_csv("tmp.csv") a b 0 1 2 >>> pd.read_csv("tmp.csv", dtype={'a': int, 'b': int}) a b 0 1 2 >>> pd.read_csv("tmp.csv", dtype={'a': int, 'b': int, 'c': int}) a b 0 1 2 >>> pd.read_csv("tmp.csv", dtype={'a': int, 'b': int, 'c': int, 'd': int}) a b 0 1 2 >>> pd.read_csv("tmp.csv", dtype={'c': int, 'd': int}) a b 0 1 22. SQL, the most used language for database: It does a very strict check for column names, and will raise an error if the name does not exist in the data base. For example, if there is table named `tmp` which contains column `a` and `b`. The following script will just not work.
SELECT CAST(tmp.c AS INT) as c from tmpIf you don't mind, would you help test the behavior of pyarrow, which is also one of the best data processing lib.
Thanks so much~
^_^ I am glad to do this, but may be a little latter. I never think about this before, honestly. I am so inspired by your scientific attitude
ulist/python/ulist/ulist.pyi
Outdated
@@ -299,7 +299,7 @@ def arange32(start: int, stop: int, step: int) -> IntegerList32: ... | |||
def arange64(start: int, stop: int, step: int) -> IntegerList64: ... | |||
|
|||
|
|||
def read_csv() -> list: ... | |||
def read_csv(path: str, schema: Sequence[Tuple[str, str]]) -> list: ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be more specific, the return type is List[LIST_RS]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yingmanwumen Thanks so much for the PR, I left some comments there. Happy Dragon Boat Festival~
Let me merge this PR and we can improve the benchmark in the future. @yingmanwumen |
Hello,
read_csv()
#151 is completed. I write a few simple tests in directorytests/test_csv
.There is a precision issue about
float32
, for example,would be parsed to
in my machine, and I don't know if the result would be different in other machine.