Data reading and processing optimizations in pySpark #29

thepushkarp · 2022-07-26T18:41:15Z

thepushkarp
Jul 26, 2022
Collaborator

Research pySpark bottlenecks and how it can be optimized for reading and processing data.

Related: #28

thepushkarp · 2022-08-12T16:37:13Z

thepushkarp
Aug 12, 2022
Collaborator Author

There are several parameters on which Spark operations can be optimised, such as changing the number of parallel tasks, using a different number of shuffle partitions, using the garbage collector, caching, etc.

Deciding on what parameters to change vary on the configuration of the user's machine such as the total memory, the number of distributed machines they are using, the number of CPU cores that is has.

Internally, Spark automatically detects many of these configurations and decides on what the best parameters should be. For a user/researcher, who would typically be using a single machine setup to run the pipeline, I think the automatic optimisations done by Spark should be sufficient for our use case.

Further Reference:

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data reading and processing optimizations in pySpark #29

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Data reading and processing optimizations in pySpark #29

thepushkarp Jul 26, 2022 Collaborator

Replies: 1 comment

thepushkarp Aug 12, 2022 Collaborator Author

thepushkarp
Jul 26, 2022
Collaborator

thepushkarp
Aug 12, 2022
Collaborator Author