Data Access
Data Access Domain
Domain
Data Access Domain
Domain is responsible for storing and managing data (whether it be raw or normalized). It serves as a central repository for all data collected from various sources, providing a foundation for downstream data processing and analysis.
Key Concepts
- Scalability: Ability to handle large volumes of data.
- Flexibility: Support for diverse data types and formats.
- Data Governance: Policies and controls for managing data.
Responsibilities
- Ingesting data from various sources.
- Storing data in a scalable and cost-effective manner.
- Maintaining data integrity and availability.
- Providing access to raw data for downstream processing.
Requirements for Future Consideration
User Persona: Data Producer
- As a Data Producer, I want to write a new dataset to an Iceberg table so that it becomes available for downstream consumption with its associated metadata, including schema and provenance.
- As a Data Producer, I want to append new data to an existing Iceberg table so that the table reflects the latest information without disrupting ongoing queries.
- As a Data Producer, I want to overwrite existing data in an Iceberg table with a new version so that I can correct errors or update the dataset.
- As a Data Producer, I want to update specific records in an Iceberg table so that I can modify individual data points efficiently.
- As a Data Producer, I want to delete specific records from an Iceberg table so that I can remove outdated or incorrect information.
- As a Data Producer, I want to evolve the schema of an Iceberg table (e.g., add a new column) so that I can accommodate new data requirements without causing compatibility issues for existing data or consumers.
- As a Data Producer, I want to partition my data when writing to an Iceberg table based on specific columns so that queries can efficiently filter relevant data.
- As a Data Producer, I want to control the data file format (e.g., Parquet, ORC) used when writing to an Iceberg table so that I can optimize for performance and storage efficiency.
- As a Data Producer, I want to specify compression settings for the data files written to Iceberg so that I can balance storage costs and I/O performance.
- As a Data Producer, I want to track the lineage or provenance of the data I write to Iceberg so that consumers can understand its origin and transformations.
- As a Data Producer, I want to manage access control policies for the Iceberg tables I create so that only authorized users can write to them.
- As a Data Producer, I want to monitor the data writing process to Iceberg so that I can identify and resolve any issues.
- As a Data Producer, I want to rollback to a previous version of the table if a data writing operation introduces errors.
User Persona: Data Consumer (Analyst, Data Scientist, Application)
- As a Data Consumer, I want to query the current version of an Iceberg table so that I can access the latest and most accurate data for analysis or application use.
- As a Data Consumer, I want to query a specific historical version (snapshot) of an Iceberg table (time travel) so that I can analyze data as it existed at a particular point in time for auditing or reproducibility.
- As a Data Consumer, I want to filter data in an Iceberg table based on specific criteria so that I can focus on the relevant subset of information.
- As a Data Consumer, I want to join data from multiple Iceberg tables so that I can perform more complex analysis and gain deeper insights.
- As a Data Consumer, I want to understand the schema of an Iceberg table so that I can correctly interpret the data.
- As a Data Consumer, I want to discover available Iceberg tables and their descriptions so that I can find the data I need.
- As a Data Consumer, I want to understand the partitioning scheme of an Iceberg table so that I can optimize my queries.
- As a Data Consumer, I want to understand the data quality or validation rules associated with an Iceberg table so that I can assess the reliability of the data.
- As a Data Consumer, I want to be notified when an Iceberg table I’m interested in is updated so that I can react to new information.
- As a Data Consumer, I want to access the lineage or provenance information of an Iceberg table so that I can understand the data’s journey.
- As a Data Consumer, I want to understand the access control policies on an Iceberg table so that I know what data I am authorized to access.
User Persona: Data Administrator/Engineer
- As a Data Administrator/Engineer, I want to create new Iceberg tables and define their initial schema and partitioning.
- As a Data Administrator/Engineer, I want to manage the metadata store (catalog) for Iceberg tables, ensuring its availability and integrity.
- As a Data Administrator/Engineer, I want to configure and manage data retention policies for Iceberg tables so that storage costs are optimized and compliance requirements are met.
- As a Data Administrator/Engineer, I want to perform compaction operations on Iceberg tables to optimize query performance and reduce the number of small files.
- As a Data Administrator/Engineer, I want to manage access control policies for Iceberg tables, granting and revoking permissions to different users and roles.
- As a Data Administrator/Engineer, I want to monitor the health and performance of Iceberg tables and the associated infrastructure.
- As a Data Administrator/Engineer, I want to perform data migration or upgrades of Iceberg tables as needed.
- As a Data Administrator/Engineer, I want to define and enforce data quality rules for Iceberg tables.
- As a Data Administrator/Engineer, I want to configure and manage data skipping strategies for Iceberg tables to improve query efficiency.
- As a Data Administrator/Engineer, I want to integrate Iceberg with various compute engines and data processing frameworks.
- As a Data Administrator/Engineer, I want to implement backup and recovery strategies for Iceberg metadata and data.