Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement read_csv method. #151

Closed
tushushu opened this issue Mar 31, 2022 · 13 comments · Fixed by #183
Closed

Implement read_csv method. #151

tushushu opened this issue Mar 31, 2022 · 13 comments · Fixed by #183
Assignees
Labels
io Input-output
Milestone

Comments

@tushushu
Copy link
Collaborator

tushushu commented Mar 31, 2022

In order to make the function efficient enough, we'd better use Rust's library to read the csv file rather than Python. The 1st version read_csv function looks like this:

>>> import ulist as ul

>>> data = ul.read_csv(path="/some_path/example.csv", schema={"foo": "int64", "bar": "bool"})
{
    "foo" : UltraFastList([1, 2, 3]), 
    "bar" : UltraFastList([True, False, True]), 
}

In the next version, the parameter schema is optional, when it's set to None, then the read_csv function will auto detect the name and dtype of each column in the .csv file.

@tushushu tushushu changed the title Implement from_csv method. Implement read_csv method. Mar 31, 2022
@tushushu tushushu added the io Input-output label Mar 31, 2022
@tushushu tushushu added this to the ulist 0.10.0 milestone Mar 31, 2022
@yingmanwumen
Copy link
Contributor

Hello, I am interested at this issue and would you like to share your designs about this method read_csv? Thanks~

@tushushu
Copy link
Collaborator Author

@yingmanwumen I've updated the description of this PR, please feel free to ask questions or share your opinions here.

@yingmanwumen
Copy link
Contributor

@tushushu
I think the main difficulty of reading CSV is that the strings are placed row by row, but CSV is organized according to columns, that means if I read the csv file row by row, I would/may inevitably need to call append many time, which may result in performance waste ( I don't konw whether this waste is acceptable)(or I can organize it by column in python in advance, but again, I don't know how efficient it is).
python has a csv library that can extract only one column ( I don't know much about python, so I have no idea about the efficient of csv library) (But it may have greater advantages in scalability and other sides).
I think one feasible way is to try to implement read_csv with csv library and without csv library at the same time and find the best scheme through benchmarks ^_^

@yingmanwumen
Copy link
Contributor

And I will be glad to know what you think , I'm still a beginner and don't understand design and so on.Thank you for your patience : )

@tushushu
Copy link
Collaborator Author

tushushu commented Apr 23, 2022

@yingmanwumen My idea is to use Rust csv crate to read the .csv file row by row and append to the Rust vectors, and then create a Rust Vec[List] object, at last wrapped by a Python object. And I think you can definitely try different ways to compare the benchmarks.
The .csv file is not as efficient as .parquet file for it's organized by row, but as it is widely used by machine learning industry, implementing read_csv method is required. And the overhead caused by reading .csv is acceptable.

@yingmanwumen
Copy link
Contributor

@yingmanwumen My idea is to use Rust csv crate to read the .csv file row by row and append to the Rust vectors, and then create a Rust Vec[List] object, at last wrapped by a Python object. And I think you can definitely try different ways to compare the benchmarks. The .csv file is not as efficient as .parquet file for it's organized by row, but as it is widely used by machine learning industry, implementing read_csv method is required. And the overhead caused by reading .csv is acceptable.

Hello, I have read the source code and found several issues here:

  1. List in Rust is an abstract trait. It is not Object Safe, therefore, Vec[List] or Vec<Box<dyn List>> is not available:
/// Abstract List with generic type elements.
pub trait List<T>
where
    T: PartialEq + Clone,
    Self: Sized,
{
...
}
  1. I have tried to fix (1) by creating Vec<PyObject> instead, but I couldn'd call append method directly in Rust:
// ... Some other codes
// a simple proof of my idea
fn do_read_csv(py: Python, mut reader: Reader<File>) -> PyResult<Vec<PyObject>> {
    // Try to create a `Vec<PyObject>`
    // Maybe complex to map a schema with unlimited length into Rust types,
    // such as "{ "foo": "int", "bar": "string" }" maps into `Integer64` and `StringList`
    let ulist = PyModule::import(py, "ulist.core")?
        .call_method1("UltraFastList", (StringList::new(vec![]),))?;
    let mut vec_list: Vec<PyObject> = Vec::new();
    vec_list.push(ulist.into());

    for iter in reader.records() {
        match iter {
            Err(err) => return Err(PyIOError::new_err(err.to_string())),
            Ok(records) => {
                for record in records.into_iter() {
                    // Just a simple proof to test how to call `append` method in rust
                    // I didn't found another way to directly call `append` in rust
                    // So if all in rust, things will be much more complicated
                    vec_list[0].call_method1(py, "append", (record,))?;
                }
            }
        }
    }
    Ok(vec_list)
}

So, after thinking for a long time(may be long), I think it's better to leave the IO aspect to python to reduce complexity.

Maybe there are some other simple ways that I haven't found to do this in Rust. If there is another better way, please let me know. Also, I hope to obtain your opinion.
Have a nice day!

@yingmanwumen
Copy link
Contributor

Another issue is that not all fields in .csv will be filled, for example:

id, name, age
1001, , 20
1002, bob, 21
1003, jordan,

name field in row 1 is not filled, and age field in row 2 is not filled. And ulist doesn't seem to have a data type which represent None or NULL.
I don't know what should I do here T_T

@tushushu
Copy link
Collaborator Author

tushushu commented Apr 26, 2022

Another issue is that not all fields in .csv will be filled, for example:

id, name, age
1001, , 20
1002, bob, 21
1003, jordan,

name field in row 1 is not filled, and age field in row 2 is not filled. And ulist doesn't seem to have a data type which represent None or NULL. I don't know what should I do here T_T

I am working on another branch which allow ulist to hold and handle missing values. You may need to wait for this for 1 week.

@tushushu
Copy link
Collaborator Author

@yingmanwumen My idea is to use Rust csv crate to read the .csv file row by row and append to the Rust vectors, and then create a Rust Vec[List] object, at last wrapped by a Python object. And I think you can definitely try different ways to compare the benchmarks. The .csv file is not as efficient as .parquet file for it's organized by row, but as it is widely used by machine learning industry, implementing read_csv method is required. And the overhead caused by reading .csv is acceptable.

Hello, I have read the source code and found several issues here:

  1. List in Rust is an abstract trait. It is not Object Safe, therefore, Vec[List] or Vec<Box<dyn List>> is not available:
/// Abstract List with generic type elements.
pub trait List<T>
where
    T: PartialEq + Clone,
    Self: Sized,
{
...
}
  1. I have tried to fix (1) by creating Vec<PyObject> instead, but I couldn'd call append method directly in Rust:
// ... Some other codes
// a simple proof of my idea
fn do_read_csv(py: Python, mut reader: Reader<File>) -> PyResult<Vec<PyObject>> {
    // Try to create a `Vec<PyObject>`
    // Maybe complex to map a schema with unlimited length into Rust types,
    // such as "{ "foo": "int", "bar": "string" }" maps into `Integer64` and `StringList`
    let ulist = PyModule::import(py, "ulist.core")?
        .call_method1("UltraFastList", (StringList::new(vec![]),))?;
    let mut vec_list: Vec<PyObject> = Vec::new();
    vec_list.push(ulist.into());

    for iter in reader.records() {
        match iter {
            Err(err) => return Err(PyIOError::new_err(err.to_string())),
            Ok(records) => {
                for record in records.into_iter() {
                    // Just a simple proof to test how to call `append` method in rust
                    // I didn't found another way to directly call `append` in rust
                    // So if all in rust, things will be much more complicated
                    vec_list[0].call_method1(py, "append", (record,))?;
                }
            }
        }
    }
    Ok(vec_list)
}

So, after thinking for a long time(may be long), I think it's better to leave the IO aspect to python to reduce complexity.

Maybe there are some other simple ways that I haven't found to do this in Rust. If there is another better way, please let me know. Also, I hope to obtain your opinion. Have a nice day!

@yingmanwumen Thanks very much for the effort and hard work. Could you look at this issue first to see if it can solve our issue?

If not, then let's try function like this:

struct AnotherStruct{
    a: Vec<StringList>,
    b: Vec<IntList>,
    c: Vec<BoolList>,
}
fn read_csv(...) -> AnotherStruct

And let Python to merge these lists(vectors) together, which will be very fast.

If this still not work, then let's just use pure Python codes to implement read_csv method.

Thanks again and have a nice day~

@yingmanwumen
Copy link
Contributor

yingmanwumen commented Apr 28, 2022

Wow, thanks a lot, I will try this again!
I have already finished a pure python version yesterday, and was blocked by the test part, for the function check_test_result was not designed for a Dict type result before, and I am trying to modify it.

That's awesome, let me know if there is any problem~

@yingmanwumen
Copy link
Contributor

I am working on another branch which allow ulist to hold and handle missing values. You may need to wait for this for 1 week.

OK, I saw that ^_^

@tushushu
Copy link
Collaborator Author

tushushu commented May 2, 2022

The branch to handle missing values has been merged into the main branch. You may need to merge the main branch into your branch first, and then fix some branch conflicts. @yingmanwumen

@tushushu tushushu modified the milestones: ulist 0.10.0, ulist 0.11.0 May 4, 2022
@tushushu
Copy link
Collaborator Author

@yingmanwumen I wrote a half done read_csv function which can convert different Rust objects into PyObjects. The PR is #170
You may merge the main branch to your working branch and continue your work on read_csv function.

@tushushu tushushu linked a pull request May 28, 2022 that will close this issue
@tushushu tushushu modified the milestones: ulist 0.11.0, ulist 0.12.0 Jun 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
io Input-output
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants