csr_matrices #122

kwchurch · 2022-06-28T16:45:25Z

I have a large csr_matrix in npz format. I'd like to use that as input as is, but it doens't have IDs field

added this to graph.py (but it doesn't work)

if 'IDs' in raw:
    self.set_node_ids(raw["IDs"].tolist())
else:
    # added by kwc                                                                                                                                                                                                                          
    self.set_node_ids(np.arange(raw["shape"][0]).tolist())

Created edg2npz.py with this:

import numpy as np
import scipy.sparse
import sys

dtype=bool
if sys.argv[2] == "int":
    dtype=int

X=[]
Y=[]

for line in sys.stdin:
    fields = line.rstrip().split()
    if len(fields) >= 2:
	x,y = fields[0:2]
	X.append(int(x))
        Y.append(int(y))

X = np.array(X, dtype=np.int32)
Y = np.array(Y, dtype=np.int32)
N = 1+max(np.max(X), np.max(Y))
V = np.ones(len(X), dtype=bool)

M = scipy.sparse.csr_matrix((V, (X, Y)), dtype=dtype, shape=(N,N))

scipy.sparse.save_npz(sys.argv[1], M)

called it with

python edg2npz.py demo/karate.bool.npz bool < demo/karate.edg

Unfortunately, I can't use this kind of csr_matrix...

I can write out my matrix to text and then run pecanpy on that, but my matrix is very large and it will take a long time to write it out and read it back. My matrix has N = 300M nodes and E=2B nonzero edges.

 pecanpy --input demo/karate.bool.npz --output demo/karate.int.emb --mode SparseOTF
init pecanpy: p = 1, q = 1, workers = 1, verbose = False, extend = False, gamma = 0, random_state = None
WARNING: when p = 1 and q = 1 with unweighted graph, highly recommend using the FirstOrderUnweighted over SparseOTF. The runtime could be improved greatly with improved  memory usage.
Took 00:00:00.02 to load Graph
Took 00:00:00.00 to pre-compute transition probabilities
Traceback (most recent call last):
  File "/home/k.church/venv/gft/bin/pecanpy", line 8, in <module>
    sys.exit(main())
  File "/home/k.church/venv/gft/lib/python3.8/site-packages/pecanpy/cli.py", line 333, in main
    walks = simulate_walks(args, g)
  File "/home/k.church/venv/gft/lib/python3.8/site-packages/pecanpy/wrappers.py", line 18, in wrapper
    result = func(*args, **kwargs)
  File "/home/k.church/venv/gft/lib/python3.8/site-packages/pecanpy/cli.py", line 320, in simulate_walks
    return g.simulate_walks(args.num_walks, args.walk_length)
  File "/home/k.church/venv/gft/lib/python3.8/site-packages/pecanpy/pecanpy.py", line 153, in simulate_walks
    walk_idx_mat = self._random_walks(
  File "/home/k.church/venv/gft/lib/python3.8/site-packages/numba/core/dispatcher.py", line 468, in _compile_for_args
    error_rewrite(e, 'typing')
  File "/home/k.church/venv/gft/lib/python3.8/site-packages/numba/core/dispatcher.py", line 409, in error_rewrite
    raise e.with_traceback(None)
numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Failed in nopython mode pipeline (step: nopython frontend)                                                                                                                                                                                          
Failed in nopython mode pipeline (step: nopython frontend)                                                                                                                                                                                          
No implementation of function Function(<built-in function itruediv>) found for signature:                                                                                                                                                           
                                                                                                                                                                                                                                                    
 >>> itruediv(array(bool, 1d, C), Literal[int](1))

There are 6 candidate implementations:

Of which 2 did not match due to:
Overload in function 'NumpyRulesInplaceArrayOperator.generic': File: numba/core/typing/npydecl.py: Line 244.
With argument(s): '(array(bool, 1d, C), int64)':
Rejected as the implementation raised a specific error:
AttributeError: 'NoneType' object has no attribute 'args'
raised from /home/k.church/venv/gft/lib/python3.8/site-packages/numba/core/typing/npydecl.py:255
Of which 2 did not match due to:
Operator Overload in function 'itruediv': File: unknown: Line unknown.
With argument(s): '(array(bool, 1d, C), int64)':
No match for registered cases:
- (int64, int64) -> float64
- (int64, uint64) -> float64
- (uint64, int64) -> float64
- (uint64, uint64) -> float64
- (float32, float32) -> float32
- (float64, float64) -> float64
- (complex64, complex64) -> complex64
- (complex128, complex128) -> complex128
Of which 2 did not match due to:
Overload of function 'itruediv': File: numba/core/typing/npdatetime.py: Line 94.
With argument(s): '(array(bool, 1d, C), int64)':
No match.

RemyLau · 2022-06-29T11:03:59Z

Hi @kwchurch, thank you for the detailed dev log! I slightly edited the format to further improve the readability. At a first glance, it looks to me like an issue of incompatible dtype. More specifically, the csr used by PecanPy uses uint32 for both the index and indptr fields, rather than int32 as used by scipy.sparse.csr. Similarly, PecanPy uses float32 instead of float64 for the data field in the csr object.

I think to resolve the type issue, the most straightforward solution is to enforce the desired types (i.e., float32 for data; uint32 for indices and `indptr) at loading time:

PecanPy/src/pecanpy/graph.py

Lines 432 to 438 in 49d6063

    
           self.data = raw["data"] 
        
           if self.data is None: 
        
               raise ValueError("Adjacency matrix data not found.") 
        
           elif not weighted: 
        
               self.data[:] = 1.0  # overwrite edge weights with constant 
        
           self.indptr = raw["indptr"] 
        
           self.indices = raw["indices"]

I will first try to reproduce the error here using the example script you provided, and then see if my proposed solution actually fixes the issue.

As we also discussed, I will add the option for implicitly assigning node IDs if it is not found in the .csr.npz file. I will make it so that it requires a "soft confirmation" from the user that the implicit assignment is desired by printing a warning message about the implicit assignment, unless a specific flag (e.g., --implicit_node_ids) is set.

RemyLau · 2022-06-29T12:48:13Z

Hi @kwchurch, I've created a new branch (see #124) implementing my suggestions above (explicit dtype setting and implicit node IDs setting). The scipy csr karate test case works fine on my end.

I will do more testing and make a unit-test for this latter today or tomorrow.

In the meantime, if you would like to give the new changes a try and let me know if this resolves your issue, that would be great. You can run it as before using

pecanpy --input demo/karate.bool.npz --output demo/karate.int.emb --mode SparseOTF

which will warn you about the implicit node IDs setting. To suppress that, you can set the --implicit_ids flag:

pecanpy --input demo/karate.bool.npz --output demo/karate.int.emb --mode SparseOTF --implicit_ids

kwchurch · 2022-06-29T14:40:04Z

ok do you think it could check the datatypes and make the necessary conversions automatically?

…

On Wed, Jun 29, 2022 at 4:04 AM Remy Liu ***@***.***> wrote: Hi @kwchurch <https://github.com/kwchurch>, thank you for the detailed dev log! I slightly edited the format to further improve the readability. At a first glance, it looks to me like an issue of incompatible dtype. More specifically, the csr used by PecanPy uses uint32 for both the index and indptr fields, rather than int32 as used by scipy.sparse.csr. Similarly, PecanPy uses float32 instead of float64 for the data field in the csr object. I think to resolve the type issue, the most straightforward solution is to enforce the desired types (i.e., float32 for data; uint32 for indices and `indptr) at loading time: https://github.com/krishnanlab/PecanPy/blob/49d60630b4589eeab992eef2da9c2eaf6b19fab8/src/pecanpy/graph.py#L432-L438 I will first try to reproduce the error here using the example script you provided, and then see if my proposed solution actually fixes the issue. As we also discussed, I will add the option for implicitly assigning node IDs if it is not found in the .csr.npz file. I will make it so that it requires a "soft confirmation" from the user that the implicit assignment is desired by printing a warning message about the implicit assignment, unless a specific flag (e.g., --implicit_node_ids) is set. — Reply to this email directly, view it on GitHub <#122 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEKUDKLY6PB4MGDDAPQ45GTVRQUSTANCNFSM52C2UW3Q> . You are receiving this because you were mentioned.Message ID: ***@***.***>

RemyLau · 2022-06-29T14:48:19Z

@kwchurch yes it is doing that now

PecanPy/src/pecanpy/graph.py

Lines 443 to 445 in a12f27c

    
           self.indptr = raw["indptr"].astype(np.uint32) 
        
           self.indices = raw["indices"].astype(np.uint32) 
        
           self.data = raw["data"].astype(np.float32)

kwchurch · 2022-06-29T14:52:23Z

great

…

On Wed, Jun 29, 2022 at 7:48 AM Remy Liu ***@***.***> wrote: @kwchurch <https://github.com/kwchurch> yes it is doing that now https://github.com/krishnanlab/PecanPy/blob/a12f27c608bb5b72651481b80380bffdf42053ab/src/pecanpy/graph.py#L443-L445 — Reply to this email directly, view it on GitHub <#122 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEKUDKJJCHYTZUN422CETSLVRRO4BANCNFSM52C2UW3Q> . You are receiving this because you were mentioned.Message ID: ***@***.***>

kwchurch · 2022-06-29T15:16:21Z

let me know when you have something ready to try out

…

On Wed, Jun 29, 2022 at 7:48 AM Remy Liu ***@***.***> wrote: @kwchurch <https://github.com/kwchurch> yes it is doing that now https://github.com/krishnanlab/PecanPy/blob/a12f27c608bb5b72651481b80380bffdf42053ab/src/pecanpy/graph.py#L443-L445 — Reply to this email directly, view it on GitHub <#122 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEKUDKJJCHYTZUN422CETSLVRRO4BANCNFSM52C2UW3Q> . You are receiving this because you were mentioned.Message ID: ***@***.***>

RemyLau · 2022-06-29T15:37:02Z

@kwchurch it is ready to be tried out, but it is not on the main branch. you'll need to checkout the scipy-csr branch, and you will find the new changes there.

RemyLau · 2022-07-01T13:53:26Z

Hi @kwchurch, I have completed some more testing and merged the new feature (implicit IDs) back to the main branch (see 2d58132). Let me know if you get a chance to test and see if this works in your case.

kwchurch · 2022-10-11T09:26:07Z

I have some graphs with nodes that have no edges Is that a problem? init pecanpy: p = 1, q = 1, workers = 16, verbose = True, extend = True, gamma = 0, random_state = None /home/k.church/venv/gft/lib/python3.8/site-packages/pecanpy/rw/sparse_rw.py:30: RuntimeWarning: Mean of empty slice. data[indptr[i] : indptr[i + 1]].mean() /home/k.church/venv/gft/lib/python3.8/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value encountered in double_scalars ret = ret.dtype.type(ret / rcount) /home/k.church/venv/gft/lib/python3.8/site-packages/numpy/core/_methods.py:262: RuntimeWarning: Degrees of freedom <= 0 for slice ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof, /home/k.church/venv/gft/lib/python3.8/site-packages/numpy/core/_methods.py:222: RuntimeWarning: invalid value encountered in true_divide arrmean = um.true_divide(arrmean, div, out=arrmean, casting='unsafe', /home/k.church/venv/gft/lib/python3.8/site-packages/numpy/core/_methods.py:254: RuntimeWarning: invalid value encountered in double_scalars ret = ret.dtype.type(ret / rcount) Traceback (most recent call last): File "/var/spool/slurm/d/job27656002/slurm_script", line 8, in <module> sys.exit(main()) File "/home/k.church/venv/gft/lib/python3.8/site-packages/pecanpy/cli.py", line 333, in main walks = simulate_walks(args, g) File "/home/k.church/venv/gft/lib/python3.8/site-packages/pecanpy/wrappers.py", line 18, in wrapper result = func(*args, **kwargs) File "/home/k.church/venv/gft/lib/python3.8/site-packages/pecanpy/cli.py", line 320, in simulate_walks return g.simulate_walks(args.num_walks, args.walk_length) File "/home/k.church/venv/gft/lib/python3.8/site-packages/pecanpy/pecanpy.py", line 153, in simulate_walks walk_idx_mat = self._random_walks( File "/home/k.church/venv/gft/lib/python3.8/site-packages/numba/core/dispatcher.py", line 468, in _compile_for_args error_rewrite(e, 'typing') File "/home/k.church/venv/gft/lib/python3.8/site-packages/numba/core/dispatcher.py", line 409, in error_rewrite raise e.with_traceback(None) numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend) ^[[1m^[[1m^[[1m^[[1mFailed in nopython mode pipeline (step: nopython frontend) ^[[1m^[[1m^[[1m^[[1mFailed in nopython mode pipeline (step: nopython frontend) ^[[1m^[[1mNo implementation of function Function(<built-in function imul>) found for signature: >> imul(array(bool, 1d, C), array(float64, 1d, C)) There are 8 candidate implementations: ^[[1m - Of which 4 did not match due to: Overload of function 'imul': File: <numerous>: Line N/A. With argument(s): '(array(bool, 1d, C), array(float64, 1d, C))':^[[0m ^[[1m No match.^[[0m ^[[1m - Of which 2 did not match due to: Overload in function 'NumpyRulesInplaceArrayOperator.generic': File: numba/core/typing/npydecl.py: Line 244. With argument(s): '(array(bool, 1d, C), array(float64, 1d, C))':^[[0m ^[[1m Rejected as the implementation raised a specific error: AttributeError: 'NoneType' object has no attribute 'args'^[[0m raised from /home/k.church/venv/gft/lib/python3.8/site-packages/numba/core/typing/npydecl.py:255 ^[[1m - Of which 2 did not match due to: Operator Overload in function 'imul': File: unknown: Line unknown. With argument(s): '(array(bool, 1d, C), array(float64, 1d, C))':^[[0m

…

On Wed, Jun 29, 2022 at 8:37 AM Remy Liu ***@***.***> wrote: @kwchurch <https://github.com/kwchurch> it is ready to be tried out, but it is not on the main branch. you'll need to checkout the scipy-csr branch, and you will find the new changes there. — Reply to this email directly, view it on GitHub <#122 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEKUDKNOBVOUVKHZ674HFRDVRRUSXANCNFSM52C2UW3Q> . You are receiving this because you were mentioned.Message ID: ***@***.***>

RemyLau added the enhancement New feature or request label Jun 29, 2022

RemyLau linked a pull request Jun 29, 2022 that will close this issue

Support scipy CSR (implicit node IDs) #124

Merged

RemyLau closed this as completed Jun 29, 2022

RemyLau reopened this Jul 1, 2022

RemyLau closed this as completed in #124 Jul 1, 2022

RemyLau reopened this Jul 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

csr_matrices #122

csr_matrices #122

kwchurch commented Jun 28, 2022 •

edited by RemyLau

Loading

RemyLau commented Jun 29, 2022

RemyLau commented Jun 29, 2022 •

edited

Loading

kwchurch commented Jun 29, 2022 via email

RemyLau commented Jun 29, 2022

kwchurch commented Jun 29, 2022 via email

kwchurch commented Jun 29, 2022 via email

RemyLau commented Jun 29, 2022

RemyLau commented Jul 1, 2022 •

edited

Loading

kwchurch commented Oct 11, 2022 via email

csr_matrices #122

csr_matrices #122

Comments

kwchurch commented Jun 28, 2022 • edited by RemyLau Loading

RemyLau commented Jun 29, 2022

RemyLau commented Jun 29, 2022 • edited Loading

kwchurch commented Jun 29, 2022 via email

RemyLau commented Jun 29, 2022

kwchurch commented Jun 29, 2022 via email

kwchurch commented Jun 29, 2022 via email

RemyLau commented Jun 29, 2022

RemyLau commented Jul 1, 2022 • edited Loading

kwchurch commented Oct 11, 2022 via email

kwchurch commented Jun 28, 2022 •

edited by RemyLau

Loading

RemyLau commented Jun 29, 2022 •

edited

Loading

RemyLau commented Jul 1, 2022 •

edited

Loading