• Raunak Shah | R2D2: Reducing Redundancy and Duplication in Data Lakes | #59

  • 2024/10/28
  • 再生時間: 31 分
  • ポッドキャスト

Raunak Shah | R2D2: Reducing Redundancy and Duplication in Data Lakes | #59

  • サマリー

  • In this episode, Raunak Shah joins us to discuss the critical issue of data redundancy in enterprise data lakes, which can lead to soaring storage and maintenance costs. Raunak highlights how large-scale data environments, ranging from terabytes to petabytes, often contain duplicate and redundant datasets that are difficult to manage. He introduces the concept of "dataset containment" and explains its significance in identifying and reducing redundancy at the table level in these massive data lakes—an area where there has been little prior work.


    Raunak then dives into the details of R2D2, a novel three-step hierarchical pipeline designed to efficiently tackle dataset containment. By utilizing schema containment graphs, statistical min-max pruning, and content-level pruning, R2D2 progressively reduces the search space to pinpoint redundant data. Raunak also discusses how the system, implemented on platforms like Azure Databricks and AWS, offers significant improvements over existing methods, processing TB-scale data lakes in just a few hours with high accuracy. He concludes with a discussion on how R2D2 optimally balances storage savings and performance by identifying datasets that can be deleted and reconstructed on demand, providing valuable insights for enterprises aiming to streamline their data management strategies.


    Materials:

    • SIGMOD'24 Paper - R2D2: Reducing Redundancy and Duplication in Data Lakes
    • ICDE'24 - Towards Optimizing Storage Costs in the Cloud



    Hosted on Acast. See acast.com/privacy for more information.

    続きを読む 一部表示

あらすじ・解説

In this episode, Raunak Shah joins us to discuss the critical issue of data redundancy in enterprise data lakes, which can lead to soaring storage and maintenance costs. Raunak highlights how large-scale data environments, ranging from terabytes to petabytes, often contain duplicate and redundant datasets that are difficult to manage. He introduces the concept of "dataset containment" and explains its significance in identifying and reducing redundancy at the table level in these massive data lakes—an area where there has been little prior work.


Raunak then dives into the details of R2D2, a novel three-step hierarchical pipeline designed to efficiently tackle dataset containment. By utilizing schema containment graphs, statistical min-max pruning, and content-level pruning, R2D2 progressively reduces the search space to pinpoint redundant data. Raunak also discusses how the system, implemented on platforms like Azure Databricks and AWS, offers significant improvements over existing methods, processing TB-scale data lakes in just a few hours with high accuracy. He concludes with a discussion on how R2D2 optimally balances storage savings and performance by identifying datasets that can be deleted and reconstructed on demand, providing valuable insights for enterprises aiming to streamline their data management strategies.


Materials:

  • SIGMOD'24 Paper - R2D2: Reducing Redundancy and Duplication in Data Lakes
  • ICDE'24 - Towards Optimizing Storage Costs in the Cloud



Hosted on Acast. See acast.com/privacy for more information.

Raunak Shah | R2D2: Reducing Redundancy and Duplication in Data Lakes | #59に寄せられたリスナーの声

カスタマーレビュー:以下のタブを選択することで、他のサイトのレビューをご覧になれます。