-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Help with understanding DataProfiler options and performance #1098
Comments
I think I made some more progress in debugging the slowness I'm seeing. When I look at the output, some columns are being marked as categorical but they have an extremely large number of unique records. For example, loading in a ~350MB parquet file and using profiling through SnakeViz, I'm seeing a lot of time spent on It took about 800 seconds to run the profile and write the JSON results to a file. The example dataset has 40 columns.
However, if I take out 5 columns that are marked as categorical that are particularly large in terms of the number of unique items and re-run my simple test now with just 35 columns, I get to 400 seconds, cutting my time in half. Is it possible via the options to override this to prevent it from hanging? I can turn off the Thank you |
And it looks like you have it already via
There doesn't appear to be a value set for these two values according to the code. It seems like it would be prudent to put a default in here? |
Hey @carlsonp -- first thanks for the detailed notes and documentation here.
|
This is a bit of a question and a bit of feature request I think?
I'm trying to understand why profiling on some tables is slow. When it calculates statistics, it seems to take a long time, even for relatively small sample sizes (less than 1,000,000 rows). I started looking into the setup of the profiling and saw the Profile Options.
I started to go through and turn off calculations I don't need. For example:
Is there a way to print out ALL the profile options including the defaults? This would help me debug and understand what is being calculated. From a feature standpoint, perhaps more of the objects should expose friendly printing of objects via
__str__
methods?Thanks
The text was updated successfully, but these errors were encountered: