feat(io): complete read_csv #183

yingmanwumen · 2022-05-27T06:55:43Z

Hello,
read_csv() #151 is completed. I write a few simple tests in directory tests/test_csv.
There is a precision issue about float32, for example,

float32, float64
3.14159, 3.14159

would be parsed to

{
    "float32": 3.141590118408203,
    "float64": 3.14159
}

in my machine, and I don't know if the result would be different in other machine.

tushushu · 2022-05-28T07:16:53Z

Thanks @yingmanwumen , I may need more time to review this PR.

tushushu · 2022-05-28T13:14:00Z

There is a precision issue about float32, for example,

@yingmanwumen I think the way we are comparing floating numbers in Python is not good enough. The check_test_result function is comparing x and y by using == operator, but sometimes it won't work well. For example 0.199+0.101 is 0.30000000000000004. Could you help to improve the check_test_result function by using math.isclose to compare the floating numbers? That's a more recommended way in Python.

tushushu · 2022-05-28T13:25:19Z

tests/test_io.py

+            })
+    ],
+)
+def test_constructors(


Perhaps another name such as test_inputs?

tushushu · 2022-05-28T13:27:22Z

tests/test_io.py

+@pytest.mark.parametrize(
+    "test_method, args, kwargs, expected_value",
+    [
+        (ul.read_csv, (), {


I am wondering if the error messages would be raised as expected, if the schema dtype doe not match actual dtype. So shall we add some more tests like this?

Other cases could be:

The schema.len() is greater than the number of columns in the .csv file;

The schema.len() is less than the number of columns in the .csv file;

The column name does not match.

Other cases could be:

* The `schema.len()` is greater than the number of columns in the `.csv` file; * The `schema.len()` is less than the number of columns in the `.csv` file; * The column name does not match.

Hello,
the strategy I chose is:

If the field in schema doesn't exists in the .csv file, then ulist will return an empty list for it, such as {..., "bar": [], ...}

If the field in .csv file doesn't exists in schema, then ulist will ignore it

So currently, cases above can not cause an exception.

For 2., I agree that the result of read_csv could be a subset of then content of .csv file, so could you also add another test case, where the schema has fewer columns than the csv file?

For 1., let's do some research on mainstream data analysis, data processing libs or even databases to see what kind of behavior would be better.

Pandas, the most popular DataFrame lib: The read_csv function will always read all the columns from csv file. No matter how many col names are in the schema. See the Doc
Suppose the tmp.csv file contains only columns a and b

>>> import pandas as pd >>> pd.read_csv("tmp.csv") a b 0 1 2 >>> pd.read_csv("tmp.csv", dtype={'a': int, 'b': int}) a b 0 1 2 >>> pd.read_csv("tmp.csv", dtype={'a': int, 'b': int, 'c': int}) a b 0 1 2 >>> pd.read_csv("tmp.csv", dtype={'a': int, 'b': int, 'c': int, 'd': int}) a b 0 1 2 >>> pd.read_csv("tmp.csv", dtype={'c': int, 'd': int}) a b 0 1 2

SQL, the most used language for database: It does a very strict check for column names, and will raise an error if the name does not exist in the data base. For example, if there is table named tmp which contains column a and b. The following script will just not work.

SELECT CAST(tmp.c AS INT) as c from tmp

If you don't mind, would you help test the behavior of pyarrow, which is also one of the best data processing lib.

Thanks so much~

And please feel free to investigate any other popular projects and share your idea here. The reason why we have to be cautious is because in real industry usage, the number of columns could be 10, 20 , 100 or even thousands. and it's very common that the developers type the incorrect col names in the schema accidentally. So simply ignore the typo, raise a warning or an error is an important topic to discuss.

For 2., I agree that the result of read_csv could be a subset of then content of .csv file, so could you also add another test case, where the schema has fewer columns than the csv file?

Sure, the test case is appended in the newest commit b86fc43:

... (ul.read_csv, (), { # schema.len() < field.len() "path": "./test_csv/04_test_nan.csv", "schema": {"int": "int", "bool": "bool"} }, { "int": [None, 2, 3, 4], "bool": [True, False, True, None] }), ...

For 1., let's do some research on mainstream data analysis, data processing libs or even databases to see what kind of behavior would be better.

1. Pandas, the most popular `DataFrame` lib: The `read_csv` function will always read **all the columns** from csv file. No matter how many col names are in the schema. See the [Doc](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) Suppose the `tmp.csv` file contains only columns `a` and `b`

>>> import pandas as pd >>> pd.read_csv("tmp.csv") a b 0 1 2 >>> pd.read_csv("tmp.csv", dtype={'a': int, 'b': int}) a b 0 1 2 >>> pd.read_csv("tmp.csv", dtype={'a': int, 'b': int, 'c': int}) a b 0 1 2 >>> pd.read_csv("tmp.csv", dtype={'a': int, 'b': int, 'c': int, 'd': int}) a b 0 1 2 >>> pd.read_csv("tmp.csv", dtype={'c': int, 'd': int}) a b 0 1 2

2. SQL, the most used language for database: It does a very strict check for column names, and will raise an error if the name does not exist in the data base. For example, if there is table named `tmp` which contains column `a` and `b`. The following script will just not work.

SELECT CAST(tmp.c AS INT) as c from tmp

If you don't mind, would you help test the behavior of pyarrow, which is also one of the best data processing lib.

Thanks so much~

^_^ I am glad to do this, but may be a little latter. I never think about this before, honestly. I am so inspired by your scientific attitude

tushushu · 2022-05-28T13:30:00Z

ulist/python/ulist/ulist.pyi

@@ -299,7 +299,7 @@ def arange32(start: int, stop: int, step: int) -> IntegerList32: ...
 def arange64(start: int, stop: int, step: int) -> IntegerList64: ...


-def read_csv() -> list: ...
+def read_csv(path: str, schema: Sequence[Tuple[str, str]]) -> list: ...


To be more specific, the return type is List[LIST_RS]?

ulist/python/ulist/io.py

tushushu

@yingmanwumen Thanks so much for the PR, I left some comments there. Happy Dragon Boat Festival~

tushushu · 2022-08-19T03:47:30Z

Let me merge this PR and we can improve the benchmark in the future. @yingmanwumen

yingmanwumen added 2 commits May 27, 2022 14:40

feat(io): complete read_csv

e1342d6

fix(io): fix newline error in read_csv

5d721ad

tushushu self-requested a review May 28, 2022 13:21

tushushu assigned yingmanwumen May 28, 2022

tushushu added the io Input-output label May 28, 2022

tushushu added this to the ulist 0.11.0 milestone May 28, 2022

tushushu linked an issue May 28, 2022 that may be closed by this pull request

Implement read_csv method. #151

Closed

tushushu reviewed May 28, 2022

View reviewed changes

tests/test_io.py Outdated

})

],

)

def test_constructors(

Copy link

Collaborator

tushushu May 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps another name such as test_inputs?

tushushu reviewed May 28, 2022

View reviewed changes

ulist/python/ulist/io.py Show resolved Hide resolved

tushushu requested changes May 28, 2022

View reviewed changes

tests(io): appends some tests and so on

b86fc43

tushushu modified the milestones: ulist 0.11.0, ulist 0.12.0 Jun 25, 2022

Merge branch 'main' into feat_read_csv

99960e0

tushushu merged commit 5aa06f7 into Rust-Data-Science:main Aug 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(io): complete read_csv #183

feat(io): complete read_csv #183

yingmanwumen commented May 27, 2022

tushushu commented May 28, 2022

tushushu commented May 28, 2022 •

edited

Loading

tushushu May 28, 2022

tushushu May 28, 2022

tushushu May 28, 2022

yingmanwumen May 28, 2022

tushushu May 29, 2022

tushushu May 29, 2022

tushushu May 29, 2022 •

edited

Loading

yingmanwumen May 29, 2022

yingmanwumen May 29, 2022

tushushu May 28, 2022

tushushu left a comment

tushushu commented Aug 19, 2022

feat(io): complete read_csv #183

feat(io): complete read_csv #183

Conversation

yingmanwumen commented May 27, 2022

tushushu commented May 28, 2022

tushushu commented May 28, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tushushu May 29, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tushushu left a comment

Choose a reason for hiding this comment

tushushu commented Aug 19, 2022

tushushu commented May 28, 2022 •

edited

Loading

tushushu May 29, 2022 •

edited

Loading