Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cursor.fetchallarrow() followed by SegmentationFault #139

Open
IceS2 opened this issue Dec 15, 2017 · 24 comments
Open

cursor.fetchallarrow() followed by SegmentationFault #139

IceS2 opened this issue Dec 15, 2017 · 24 comments
Labels

Comments

@IceS2
Copy link

IceS2 commented Dec 15, 2017

Hello guys, it's the first time I post an Issue on a project, so I'm sorry if I'm doing it the wrong way, please correct me if wrong (=

I'm trying to use turbodbc with pyarrow and I'm running into a segmentation fault issue.
I'm querying a SQLServer database using FreeTDS. After I assign cursor.fetchallarrow() to a variable, it runs automatically into a segmentation fault. If it doesn't run automatically into the segmentation fault, as soon as I try to do anything with that variable it runs into segmentation fault.
My python version and installed packages:

Python 3.6.3
ansible==2.4.0.0
asn1crypto==0.23.0
attrs==17.3.0
avro-python3==1.8.2
awscli==1.11.143
bcrypt==3.1.3
beautifulsoup4==4.6.0
boto==2.48.0
boto3==1.4.7
botocore==1.7.1
bs4==0.0.1
cached-property==1.3.0
certifi==2017.7.27.1
cffi==1.11.2
chardet==3.0.4
colorama==0.3.7
colorclass==2.2.0
configparser==3.5.0
cryptography==2.0.3
Cython==0.27.3
decorator==4.1.2
docker==2.5.1
docker-compose==1.15.0
docker-pycreds==0.2.1
dockerpty==0.4.1
docopt==0.6.2
docutils==0.14
formats==0.1.1
google-api-python-client==1.6.4
gspread==0.6.2
httplib2==0.10.3
idna==2.6
ipython==6.1.0
ipython-genutils==0.2.0
jedi==0.10.2
Jinja2==2.9.6
jmespath==0.9.3
jsonschema==2.6.0
MarkupSafe==1.0
mock==2.0.0
numpy==1.13.1
oauth2client==4.1.2
pandas==0.20.3
paramiko==2.3.1
pbr==3.1.1
pexpect==4.2.1
pickleshare==0.7.4
pluggy==0.6.0
prompt-toolkit==1.0.15
ptyprocess==0.5.2
py==1.5.2
pyarrow==0.7.1
pyasn1==0.3.7
pyasn1-modules==0.1.5
pybind11==2.2.1
pycairo==1.15.4
pycparser==2.18
pycrypto==2.6.1
Pygments==2.2.0
pymssql==2.1.3
PyMySQL==0.7.11
PyNaCl==1.1.2
pyOpenSSL==17.3.0
pytest==3.3.0
python-dateutil==2.6.1
pytz==2017.2
pywal==0.7.1
PyYAML==3.12
requests==2.18.4
rsa==3.4.2
s3transfer==0.1.10
simplegeneric==0.8.1
six==1.11.0
slacker==0.9.60
SQLAlchemy==1.1.13
texttable==0.8.8
tortilla==0.4.2
traitlets==4.3.2
turbodbc==2.4.1
ua-parser==0.7.3
Unidecode==0.4.21
uritemplate==3.0.0
urllib3==1.22
user-agents==1.1.0
wcwidth==0.1.7
websocket-client==0.44.0
xlrd==1.1.0

You can use the next code to try to reproduce the issue. I just took off the database credentials.

from turbodbc import connect, make_options

options = make_options(prefer_unicode=True)
connection = connect(driver='FreeTDS', server='<server>', port='<port>', database='<database>', uid='<uid>', pwd='<pwd>', turbodb_options=options)

cursor = connection.cursor()
cursor.execute('select * from <table>')

table = cursor.fetchallarrow()
@xhochy
Copy link
Collaborator

xhochy commented Dec 15, 2017

Can you provide us with a backtrace related to the segfault?

On Linux you can get it with:

ulimit -c unlimited
<run python code>
gdb python core

In the then resulting gdb prompt, enter bt full and paste the output here (please be careful that it does not contain credentials).

@IceS2
Copy link
Author

IceS2 commented Dec 15, 2017

It seems I can't Oo... Any idea why?

$ ulimit -c unlimited
$ python test_turbodbc_pyarrow.py
[1]    26933 segmentation fault (core dumped)  python test_turbodbc_pyarrow.py
$ gdb python core
GNU gdb (GDB) 8.0.1
Copyright © 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...(no debugging symbols found)...done.
/home/pablo/workspace/scratch/core: No such file or directory.
(gdb) bt full
No stack.
(gdb) 

@MathMagique
Copy link
Member

Hello @IceS2! Thanks for reporting! You did well :-).

I have a hunch that the prefer_unicode=True in combination with fetchallarrow() is the culprit here, as I fear that this code path is not properly implemented yet. Even though prefer_unicode=True is the recommended setting for MSSQL, please check whether the segmentation fault disappears if this option is set to False.

As a workaround, you could use fetchallnumpy() instead of fetchallarrow(). Performance is comparable, and fetchallnumpy() has full support for prefer_unicode=True.

@xhochy
Copy link
Collaborator

xhochy commented Dec 16, 2017

@IceS2 it could also be that your core is named core.26933 (taken from the message 26933 segmentation fault (core dumped)). If the numbered suffix is used depends a bit on your distribution.

@IceS2
Copy link
Author

IceS2 commented Dec 18, 2017

@MathMagique, @xhochy, Sorry for the delayed answer. Wasn't near my computer past weekend!
So, I've run the code again setting prefer_unicode=False and the result was the same: [1] 23037 segmentation fault (core dumped) without any backtrace.

It seems to work with cursor.fetchallnumpy(). I was testing turbodbc because I'm experimenting with pyarrow and I need to do some batch extractions from a database. turbodbc into arrow table would be awesome!
My fallback plan is to work with SqlAlchemy and Pandas. Not sure how to transform the OrderedDict from cursor.fetchallnumpy() to a pyarrow table.

@dirkjonker
Copy link
Contributor

What version of FreeTDS and unixODBC are you using? Can you test using the Microsoft ODBC driver for Linux instead of FreeTDS? See: https://docs.microsoft.com/en-us/sql/connect/odbc/linux-mac/installing-the-microsoft-odbc-driver-for-sql-server

@IceS2
Copy link
Author

IceS2 commented Dec 18, 2017

Hey @dirkjonker, I've just tested using the Microsoft ODBC driver you mentioned. The result was the same [1] 3542 segmentation fault (core dumped)

The version of the packages you asked are

extra/unixodbc 2.3.4-2
extra/freetds 1.00.44-1
local/msodbcsql 13.1.9.1-1

@dirkjonker
Copy link
Contributor

That's too bad, sometimes switching the driver works to resolve this type of problem.

What types of columns are in the table you are selecting from?

@xhochy
Copy link
Collaborator

xhochy commented Dec 19, 2017

@IceS2 are you on Fedora 24+? There we have a known problem with pyarrow in combination with turbodbc.

@xhochy
Copy link
Collaborator

xhochy commented Dec 19, 2017

It can be fixed by also building pyarrow from source which is not totally simple: https://arrow.apache.org/docs/python/development.html#developing-on-linux-and-macos or we could continue to work on providing manylinux1 Wheels for turbodbc: #108

Alternatively, using a conda based installation instead of a pip-based one will work.

@IceS2
Copy link
Author

IceS2 commented Dec 19, 2017

@xhochy, I'm actually running Arch Linux!
Do you think it'd be fixed as well by building pyarrow from source? I could try that as soon as I get some "me time"

@xhochy
Copy link
Collaborator

xhochy commented Dec 19, 2017

@IceS2 It could be a possible fix. I guess the Fedora problem is due to Turbodbc being compiled with a different C++ ABI than the pyarrow wheel. Rebuilding both with the same ABI should fix the problems.

@IceS2
Copy link
Author

IceS2 commented Jan 5, 2018

Hey @xhochy, Sorry for the late answer. I had to work on other stuff first.
I'm back at turbodbc, but after I upgraded pyarrow to 0.8.0, I was getting an error with turbodbc saying I didn't have the pyarrow support installed. So I uninstalled turbodbc and tried to install it back with pip, but I'm getting error: command 'gcc' failed with exit status 1
Can you help me out? Thanks!

@MathMagique
Copy link
Member

@IceS2 Hi again! Have you tried using more recent versions of turbodbc/pyarrow in the mean time? Does this fix things?

@albertoRamon
Copy link

albertoRamon commented Sep 18, 2018

Same error, with same line (the last)

from turbodbc import connect
import  pyarrow
connection = connect(dsn='mysql_DNS_ANSI')
cursor = connection.cursor()
cursor.execute('SELECT col1 from test01;')
table = cursor.fetchallarrow()

change last time to print cursor.fetchall() returns:

[[1L], [2L], [3L], [4L], [5L]]

Can be reproduced with this command:

docker run -it albertozgz/turbodbc_extrator:debian9 bash

(You only need connect this Docker to your database, I uses MySQL 8.0)

TIP1: table=cursor.fetchallnumpy() works fine
TIP2: tested ANSI and UNICODE driver
TIP3: tested fetchallarrow(adaptive_integers=True/False)
TIP4:

batches = cursor.fetcharrowbatches()
for batch in batches:
  print(batch)

segmentation fault (core dumped)

@MathMagique
Copy link
Member

@xhochy Would you have the time to look at @albertoRamon 's reproducing example, please?

@xhochy
Copy link
Collaborator

xhochy commented Sep 19, 2018

This is the same problem as above. Debian 9 builds with by default with a different C++ ABI than the pyarrow wheels are built with. As long as we don't ship turbodbc manylinux1 wheels, these segfaults will persist.

@MathMagique
Copy link
Member

Would it work to switch to the conda environment with our "blessed" builds?

@xhochy
Copy link
Collaborator

xhochy commented Sep 19, 2018

Yes using pyarrow and turbodbc both from conda-forge will work. They are both build in the same consistent environment.

@MathMagique
Copy link
Member

@albertoRamon Could you try using the turbodbc conda package, please? https://anaconda.org/conda-forge/turbodbc

@albertoRamon
Copy link

Yes of course

Any test or test that they want to do I can prove it
Or if the solution is not to use debian9 (I tried with Alpine3.8 and Debian10 and it did not work)

@MathMagique
Copy link
Member

Anything too modern will not work because the precompiled pyarrow wheel uses a "classic" version of the ABIs, while pip install turbodbc will compile stuff with the latest and greatest ABIs. Conda packages for turbodbc and pyarrow are built with consistent settings, and should work on any modern system.

@albertoRamon
Copy link

@MathMagique @xhochy , Thanks
Your suggestion works fine

 pip uninstall pyarrow
 pip uninstall turbodbc
 
 wget https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh 
 chmod +x  Miniconda2-latest-Linux-x86_64.sh 
 ./Miniconda2-latest-Linux-x86_64.sh 
 conda install -c conda-forge pyarrow
 source ~/.bashrc
 
 conda install -c conda-forge pyarrow
 conda install -c conda-forge turbodbc

python:

table = cursor.fetchallarrow()
print table.num_rows

bash:> 5

If you think that the best option for production environment is download code from Git and compile it. I will be happy to modify the docker file to realize these steps

BR

@MathMagique
Copy link
Member

I never would download code from Git for production; if anything, download source packages from pypi.org. I'd suggest to go down the conda route for production, however, as this has already solved the hassle of compiling stuff the right way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants