Implement `read_csv` method. #151

tushushu · 2022-03-31T12:02:35Z

In order to make the function efficient enough, we'd better use Rust's library to read the csv file rather than Python. The 1st version read_csv function looks like this:

>>> import ulist as ul

>>> data = ul.read_csv(path="/some_path/example.csv", schema={"foo": "int64", "bar": "bool"})
{
    "foo" : UltraFastList([1, 2, 3]), 
    "bar" : UltraFastList([True, False, True]), 
}

In the next version, the parameter schema is optional, when it's set to None, then the read_csv function will auto detect the name and dtype of each column in the .csv file.

The text was updated successfully, but these errors were encountered:

yingmanwumen · 2022-04-20T06:29:13Z

Hello, I am interested at this issue and would you like to share your designs about this method read_csv? Thanks~

tushushu · 2022-04-20T13:32:51Z

@yingmanwumen I've updated the description of this PR, please feel free to ask questions or share your opinions here.

yingmanwumen · 2022-04-21T14:33:59Z

@tushushu
I think the main difficulty of reading CSV is that the strings are placed row by row, but CSV is organized according to columns, that means if I read the csv file row by row, I would/may inevitably need to call append many time, which may result in performance waste ( I don't konw whether this waste is acceptable)(or I can organize it by column in python in advance, but again, I don't know how efficient it is).
python has a csv library that can extract only one column ( I don't know much about python, so I have no idea about the efficient of csv library) (But it may have greater advantages in scalability and other sides).
I think one feasible way is to try to implement read_csv with csv library and without csv library at the same time and find the best scheme through benchmarks ^_^

yingmanwumen · 2022-04-21T14:43:38Z

And I will be glad to know what you think , I'm still a beginner and don't understand design and so on.Thank you for your patience : )

tushushu · 2022-04-23T07:49:51Z

@yingmanwumen My idea is to use Rust csv crate to read the .csv file row by row and append to the Rust vectors, and then create a Rust Vec[List] object, at last wrapped by a Python object. And I think you can definitely try different ways to compare the benchmarks.
The .csv file is not as efficient as .parquet file for it's organized by row, but as it is widely used by machine learning industry, implementing read_csv method is required. And the overhead caused by reading .csv is acceptable.

yingmanwumen · 2022-04-25T07:13:04Z

@yingmanwumen My idea is to use Rust csv crate to read the .csv file row by row and append to the Rust vectors, and then create a Rust Vec[List] object, at last wrapped by a Python object. And I think you can definitely try different ways to compare the benchmarks. The .csv file is not as efficient as .parquet file for it's organized by row, but as it is widely used by machine learning industry, implementing read_csv method is required. And the overhead caused by reading .csv is acceptable.

Hello, I have read the source code and found several issues here:

List in Rust is an abstract trait. It is not Object Safe, therefore, Vec[List] or Vec<Box<dyn List>> is not available:

/// Abstract List with generic type elements.
pub trait List<T>
where
    T: PartialEq + Clone,
    Self: Sized,
{
...
}

I have tried to fix (1) by creating Vec<PyObject> instead, but I couldn'd call append method directly in Rust:

// ... Some other codes
// a simple proof of my idea
fn do_read_csv(py: Python, mut reader: Reader<File>) -> PyResult<Vec<PyObject>> {
    // Try to create a `Vec<PyObject>`
    // Maybe complex to map a schema with unlimited length into Rust types,
    // such as "{ "foo": "int", "bar": "string" }" maps into `Integer64` and `StringList`
    let ulist = PyModule::import(py, "ulist.core")?
        .call_method1("UltraFastList", (StringList::new(vec![]),))?;
    let mut vec_list: Vec<PyObject> = Vec::new();
    vec_list.push(ulist.into());

    for iter in reader.records() {
        match iter {
            Err(err) => return Err(PyIOError::new_err(err.to_string())),
            Ok(records) => {
                for record in records.into_iter() {
                    // Just a simple proof to test how to call `append` method in rust
                    // I didn't found another way to directly call `append` in rust
                    // So if all in rust, things will be much more complicated
                    vec_list[0].call_method1(py, "append", (record,))?;
                }
            }
        }
    }
    Ok(vec_list)
}

So, after thinking for a long time(may be long), I think it's better to leave the IO aspect to python to reduce complexity.

Maybe there are some other simple ways that I haven't found to do this in Rust. If there is another better way, please let me know. Also, I hope to obtain your opinion.
Have a nice day!

yingmanwumen · 2022-04-25T14:49:00Z

Another issue is that not all fields in .csv will be filled, for example:

id, name, age
1001, , 20
1002, bob, 21
1003, jordan,

name field in row 1 is not filled, and age field in row 2 is not filled. And ulist doesn't seem to have a data type which represent None or NULL.
I don't know what should I do here T_T

tushushu · 2022-04-26T13:30:19Z

Another issue is that not all fields in .csv will be filled, for example:
id, name, age
1001, , 20
1002, bob, 21
1003, jordan,
name field in row 1 is not filled, and age field in row 2 is not filled. And ulist doesn't seem to have a data type which represent None or NULL. I don't know what should I do here T_T

I am working on another branch which allow ulist to hold and handle missing values. You may need to wait for this for 1 week.

tushushu · 2022-04-26T14:07:11Z

@yingmanwumen My idea is to use Rust csv crate to read the .csv file row by row and append to the Rust vectors, and then create a Rust Vec[List] object, at last wrapped by a Python object. And I think you can definitely try different ways to compare the benchmarks. The .csv file is not as efficient as .parquet file for it's organized by row, but as it is widely used by machine learning industry, implementing read_csv method is required. And the overhead caused by reading .csv is acceptable.

Hello, I have read the source code and found several issues here:

List in Rust is an abstract trait. It is not Object Safe, therefore, Vec[List] or Vec<Box<dyn List>> is not available:
/// Abstract List with generic type elements.
pub trait List<T>
where
    T: PartialEq + Clone,
    Self: Sized,
{
...
}
I have tried to fix (1) by creating Vec<PyObject> instead, but I couldn'd call append method directly in Rust:
// ... Some other codes
// a simple proof of my idea
fn do_read_csv(py: Python, mut reader: Reader<File>) -> PyResult<Vec<PyObject>> {
    // Try to create a `Vec<PyObject>`
    // Maybe complex to map a schema with unlimited length into Rust types,
    // such as "{ "foo": "int", "bar": "string" }" maps into `Integer64` and `StringList`
    let ulist = PyModule::import(py, "ulist.core")?
        .call_method1("UltraFastList", (StringList::new(vec![]),))?;
    let mut vec_list: Vec<PyObject> = Vec::new();
    vec_list.push(ulist.into());

    for iter in reader.records() {
        match iter {
            Err(err) => return Err(PyIOError::new_err(err.to_string())),
            Ok(records) => {
                for record in records.into_iter() {
                    // Just a simple proof to test how to call `append` method in rust
                    // I didn't found another way to directly call `append` in rust
                    // So if all in rust, things will be much more complicated
                    vec_list[0].call_method1(py, "append", (record,))?;
                }
            }
        }
    }
    Ok(vec_list)
}
So, after thinking for a long time(may be long), I think it's better to leave the IO aspect to python to reduce complexity.

Maybe there are some other simple ways that I haven't found to do this in Rust. If there is another better way, please let me know. Also, I hope to obtain your opinion. Have a nice day!

@yingmanwumen Thanks very much for the effort and hard work. Could you look at this issue first to see if it can solve our issue?

If not, then let's try function like this:

struct AnotherStruct{
    a: Vec<StringList>,
    b: Vec<IntList>,
    c: Vec<BoolList>,
}
fn read_csv(...) -> AnotherStruct

And let Python to merge these lists(vectors) together, which will be very fast.

If this still not work, then let's just use pure Python codes to implement read_csv method.

Thanks again and have a nice day~

yingmanwumen · 2022-04-28T03:06:13Z

Wow, thanks a lot, I will try this again!
I have already finished a pure python version yesterday, and was blocked by the test part, for the function check_test_result was not designed for a Dict type result before, and I am trying to modify it.

That's awesome, let me know if there is any problem~

yingmanwumen · 2022-04-28T03:07:40Z

I am working on another branch which allow ulist to hold and handle missing values. You may need to wait for this for 1 week.

OK, I saw that ^_^

tushushu · 2022-05-02T03:19:02Z

The branch to handle missing values has been merged into the main branch. You may need to merge the main branch into your branch first, and then fix some branch conflicts. @yingmanwumen

tushushu · 2022-05-15T02:23:34Z

@yingmanwumen I wrote a half done read_csv function which can convert different Rust objects into PyObjects. The PR is #170
You may merge the main branch to your working branch and continue your work on read_csv function.

tushushu changed the title ~~Implement from_csv method.~~ Implement read_csv method. Mar 31, 2022

tushushu added the io Input-output label Mar 31, 2022

tushushu added this to the ulist 0.10.0 milestone Mar 31, 2022

tushushu assigned yingmanwumen Apr 20, 2022

tushushu modified the milestones: ulist 0.10.0, ulist 0.11.0 May 4, 2022

yingmanwumen mentioned this issue May 27, 2022

feat(io): complete read_csv #183

Merged

tushushu linked a pull request May 28, 2022 that will close this issue

feat(io): complete read_csv #183

Merged

tushushu modified the milestones: ulist 0.11.0, ulist 0.12.0 Jun 25, 2022

tushushu closed this as completed in #183 Aug 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `read_csv` method. #151

Implement `read_csv` method. #151

tushushu commented Mar 31, 2022 •

edited

Loading

yingmanwumen commented Apr 20, 2022

tushushu commented Apr 20, 2022

yingmanwumen commented Apr 21, 2022

yingmanwumen commented Apr 21, 2022

tushushu commented Apr 23, 2022 •

edited

Loading

yingmanwumen commented Apr 25, 2022

yingmanwumen commented Apr 25, 2022

tushushu commented Apr 26, 2022 •

edited

Loading

tushushu commented Apr 26, 2022

yingmanwumen commented Apr 28, 2022 •

edited by tushushu

Loading

yingmanwumen commented Apr 28, 2022

tushushu commented May 2, 2022

tushushu commented May 15, 2022

Implement read_csv method. #151

Implement read_csv method. #151

Comments

tushushu commented Mar 31, 2022 • edited Loading

yingmanwumen commented Apr 20, 2022

tushushu commented Apr 20, 2022

yingmanwumen commented Apr 21, 2022

yingmanwumen commented Apr 21, 2022

tushushu commented Apr 23, 2022 • edited Loading

yingmanwumen commented Apr 25, 2022

yingmanwumen commented Apr 25, 2022

tushushu commented Apr 26, 2022 • edited Loading

tushushu commented Apr 26, 2022

yingmanwumen commented Apr 28, 2022 • edited by tushushu Loading

yingmanwumen commented Apr 28, 2022

tushushu commented May 2, 2022

tushushu commented May 15, 2022

Implement `read_csv` method. #151

Implement `read_csv` method. #151

tushushu commented Mar 31, 2022 •

edited

Loading

tushushu commented Apr 23, 2022 •

edited

Loading

tushushu commented Apr 26, 2022 •

edited

Loading

yingmanwumen commented Apr 28, 2022 •

edited by tushushu

Loading