Overwrite Table Partitions Using PySpark
Scenario
Source:
Target
Example Datasets
Solution Walk through

Merge the source and target dataframes by column name
De-duplicate dataframe using row_number() and window() and select the most latest record for each order_id
row_number() and window() and select the most latest record for each order_idOverwrite the partitions with new curated datasets to output location
Putting it all together in one script
Considerations
Have suggestions? Join our Slack channel to share feedback.
Last updated
Was this helpful?