ETL vs ELT

Extract, Transform and Load (where the transform phase and load phase can be swapped) refers to processes for organising a data pipeline into conceptual or actual phases or stages.

WIP: Incomplete.

We worked on this for the Thinkathon in 2023 Aginic Bootcamp and while it was an entertaining exercise I felt unsatisfied with the brief overview we were able to cover. I personally would like a more thorough analysis of the differences, reasons and best practices regarding the topic.

What do you think? Leave a comment below if you feel this would be worthwhile.

Extract Transform Load vs Extract Load Transform

Extract Transform Load (ETL)

  • Has been used by data engineers since the 1970s. 1
  • Primarily used for relational data warehouses where the sources may be varied but the destination only supports structured data.

Pros

  • Cheaper storage
  • The extract and transform stages are typically combined, allowing for faster access to insights because the data is immediately available once it is loaded

Cons / Challenges

  • Potential loss of data, when transformation fails on highly volatile data
  • If the source data is changing, it can be impossible to re-run the process on the original data when a problem is discovered changes
  • Multiple downstream sources might need same source data, increasing pressure on source system

Extract Load Transform (ELT)

It’s probably more like ELTLTLTL

Pros

  • Can be more robust as replicating data is less complex than transforming and so if the pipeline has problems in the transformation stage, it can be rerun from the raw data without disturbing
  • Can re-run pipelines on the same data if anything goes wrong
  • Process is more staged

Cons / Challenges

  • Increased storage cost, keeping raw data
  • Governance (GDPR / HIPAA / CCPA) can be more difficult to do right:
    • TODO: Re-find the article that made this argument 2 3 4 5 6

References



Last modified January 10, 2023: whoops (6fdf6e0)