Data reading and processing optimizations in pySpark #29
Replies: 1 comment
-
There are several parameters on which Spark operations can be optimised, such as changing the number of parallel tasks, using a different number of shuffle partitions, using the garbage collector, caching, etc. Deciding on what parameters to change vary on the configuration of the user's machine such as the total memory, the number of distributed machines they are using, the number of CPU cores that is has. Internally, Spark automatically detects many of these configurations and decides on what the best parameters should be. For a user/researcher, who would typically be using a single machine setup to run the pipeline, I think the automatic optimisations done by Spark should be sufficient for our use case. Further Reference: |
Beta Was this translation helpful? Give feedback.
-
Research pySpark bottlenecks and how it can be optimized for reading and processing data.
Related: #28
Beta Was this translation helpful? Give feedback.
All reactions