Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(io): complete read_csv #183

Merged
merged 4 commits into from
Aug 20, 2022
Merged

feat(io): complete read_csv #183

merged 4 commits into from
Aug 20, 2022

Conversation

yingmanwumen
Copy link
Contributor

Hello,
read_csv() #151 is completed. I write a few simple tests in directory tests/test_csv.
There is a precision issue about float32, for example,

float32, float64
3.14159, 3.14159

would be parsed to

{
    "float32": 3.141590118408203,
    "float64": 3.14159
}

in my machine, and I don't know if the result would be different in other machine.

@tushushu
Copy link
Collaborator

Thanks @yingmanwumen , I may need more time to review this PR.

@tushushu
Copy link
Collaborator

tushushu commented May 28, 2022

There is a precision issue about float32, for example,

@yingmanwumen I think the way we are comparing floating numbers in Python is not good enough. The check_test_result function is comparing x and y by using == operator, but sometimes it won't work well. For example 0.199+0.101 is 0.30000000000000004. Could you help to improve the check_test_result function by using math.isclose to compare the floating numbers? That's a more recommended way in Python.

@tushushu tushushu self-requested a review May 28, 2022 13:21
@tushushu tushushu added the io Input-output label May 28, 2022
@tushushu tushushu added this to the ulist 0.11.0 milestone May 28, 2022
@tushushu tushushu linked an issue May 28, 2022 that may be closed by this pull request
tests/test_io.py Outdated
})
],
)
def test_constructors(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps another name such as test_inputs?

tests/test_io.py Outdated
@pytest.mark.parametrize(
"test_method, args, kwargs, expected_value",
[
(ul.read_csv, (), {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if the error messages would be raised as expected, if the schema dtype doe not match actual dtype. So shall we add some more tests like this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other cases could be:

  • The schema.len() is greater than the number of columns in the .csv file;
  • The schema.len() is less than the number of columns in the .csv file;
  • The column name does not match.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other cases could be:

* The `schema.len()` is greater than the number of columns in the `.csv` file;

* The `schema.len()` is less than the number of columns in the `.csv` file;

* The column name does not match.

Hello,
the strategy I chose is:

  1. If the field in schema doesn't exists in the .csv file, then ulist will return an empty list for it, such as {..., "bar": [], ...}
  2. If the field in .csv file doesn't exists in schema, then ulist will ignore it

So currently, cases above can not cause an exception.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For 2., I agree that the result of read_csv could be a subset of then content of .csv file, so could you also add another test case, where the schema has fewer columns than the csv file?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For 1., let's do some research on mainstream data analysis, data processing libs or even databases to see what kind of behavior would be better.

  1. Pandas, the most popular DataFrame lib: The read_csv function will always read all the columns from csv file. No matter how many col names are in the schema. See the Doc
    Suppose the tmp.csv file contains only columns a and b
>>> import pandas as pd
>>> pd.read_csv("tmp.csv")
   a   b
0  1   2
>>> pd.read_csv("tmp.csv", dtype={'a': int, 'b': int})
   a   b
0  1   2
>>> pd.read_csv("tmp.csv", dtype={'a': int, 'b': int, 'c': int})
   a   b
0  1   2
>>> pd.read_csv("tmp.csv", dtype={'a': int, 'b': int, 'c': int, 'd': int})
   a   b
0  1   2
>>> pd.read_csv("tmp.csv", dtype={'c': int, 'd': int})
   a   b
0  1   2
  1. SQL, the most used language for database: It does a very strict check for column names, and will raise an error if the name does not exist in the data base. For example, if there is table named tmp which contains column a and b. The following script will just not work.
SELECT CAST(tmp.c AS INT) as c  from tmp  

If you don't mind, would you help test the behavior of pyarrow, which is also one of the best data processing lib.

Thanks so much~

Copy link
Collaborator

@tushushu tushushu May 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And please feel free to investigate any other popular projects and share your idea here. The reason why we have to be cautious is because in real industry usage, the number of columns could be 10, 20 , 100 or even thousands. and it's very common that the developers type the incorrect col names in the schema accidentally. So simply ignore the typo, raise a warning or an error is an important topic to discuss.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For 2., I agree that the result of read_csv could be a subset of then content of .csv file, so could you also add another test case, where the schema has fewer columns than the csv file?

Sure, the test case is appended in the newest commit b86fc43:

...
            (ul.read_csv, (), {  # schema.len() < field.len()
                "path": "./test_csv/04_test_nan.csv",
                "schema": {"int": "int",
                           "bool": "bool"}
            }, {
                "int": [None, 2, 3, 4],
                "bool": [True, False, True, None]
            }),
...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For 1., let's do some research on mainstream data analysis, data processing libs or even databases to see what kind of behavior would be better.

1. Pandas, the most popular `DataFrame` lib: The `read_csv` function will always read **all the columns** from csv file. No matter how many col names are in the schema. See the [Doc](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)
   Suppose the `tmp.csv` file contains only columns `a` and `b`
>>> import pandas as pd
>>> pd.read_csv("tmp.csv")
   a   b
0  1   2
>>> pd.read_csv("tmp.csv", dtype={'a': int, 'b': int})
   a   b
0  1   2
>>> pd.read_csv("tmp.csv", dtype={'a': int, 'b': int, 'c': int})
   a   b
0  1   2
>>> pd.read_csv("tmp.csv", dtype={'a': int, 'b': int, 'c': int, 'd': int})
   a   b
0  1   2
>>> pd.read_csv("tmp.csv", dtype={'c': int, 'd': int})
   a   b
0  1   2
2. SQL, the most used language for database: It does a very strict check for column names, and will raise an error if the name does not exist in the data base. For example, if there is table named `tmp` which contains column `a` and `b`.  The following script will just not work.
SELECT CAST(tmp.c AS INT) as c  from tmp  

If you don't mind, would you help test the behavior of pyarrow, which is also one of the best data processing lib.

Thanks so much~

^_^ I am glad to do this, but may be a little latter. I never think about this before, honestly. I am so inspired by your scientific attitude

@@ -299,7 +299,7 @@ def arange32(start: int, stop: int, step: int) -> IntegerList32: ...
def arange64(start: int, stop: int, step: int) -> IntegerList64: ...


def read_csv() -> list: ...
def read_csv(path: str, schema: Sequence[Tuple[str, str]]) -> list: ...
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be more specific, the return type is List[LIST_RS]?

Copy link
Collaborator

@tushushu tushushu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yingmanwumen Thanks so much for the PR, I left some comments there. Happy Dragon Boat Festival~

@tushushu tushushu modified the milestones: ulist 0.11.0, ulist 0.12.0 Jun 25, 2022
@tushushu
Copy link
Collaborator

Let me merge this PR and we can improve the benchmark in the future. @yingmanwumen

@tushushu tushushu merged commit 5aa06f7 into Rust-Data-Science:main Aug 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
io Input-output
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement read_csv method.
2 participants