Designing a Data Warehouse

A Data Warehouse (DW) is a collection of data oriented to a certain integrated field, non-volatile and variable in time, that helps decision making. It is used to generate reports and dashboards as well as for data analysis and is a fundamental component in Business Intelligence (BI).

A DW is responsible for extracting data from the operational databases or external sources, transforming, consolidating, integrating, checking integrity and centralizing the data that the company generates in its daily business activity and/or external information with which it is related. The particularities of the DW make that the data models and the design strategies used for operational databases, generally do not serve for their design. This implies that new techniques and design strategies are used.

DW can grow and the need arises to divide the information into data marts. They are usually departmental solutions and store a specific business area.

The objectives of a DW are:

  1. Accessibility of information
  2. Consistency of information between business units
  3. Assistance in decision-making

Among its characteristics, we can mention that it manages large amounts of data, keeps data history, concentrates, adds, and integrates information from many sources.

Due to its objective, it is necessary to change the traditional entity-relationship models in operational bases as they are not adequate to obtain accumulated and historical data. A series of ETL (Extract, Transform, Load) processes are usually performed to obtain a multidimensional model and thus to perform analytical queries in a more optimal way. These queries are normally performed on an essential fact from a series of parameters. For example, sales with a number of variables such as time, customer and product.

From the conceptual model, it is possible to begin to consider how the functional architecture of the DW will be. For this, it´s necessary to contemplate the existence of an intermediate layer between the source systems and the DW, which will serve as a layer of abstraction between the world of processes and the analytical world, allowing optimization of integration with the definitive model.

It is necessary to define 2 bases:

  • Staging: it is the intermediate layer that will serve as storage between the source systems and the DW. Its mission is to offer the possibility of managing the data with the format and structure of origin to facilitate the integration and transformation towards the denormalized model that owns the DW.
  • Data Warehouse: is the database where data extracted from various sources are consolidated.

The DW requires the definition of a logical and physical data model. The logical model is rather descriptive, where the objects that will conform the physical model (tables, fields and relations) are defined, although without entering into the technical details of the data type or restrictions. The physical model is about low-level data structures whose implementation has been taking into account the database engine itself. The physical model translates that logical model, physically determining how it is going to be (integer, decimal, etc.).

This type of data model consists mainly of two types of elements:

  • Dimensions: represent factors by which a particular area of business is analyzed. They are small and usually denormalized.
  • Fact: they are the object of the analyzes and are related to the dimensions. They are very large boards and are usually denormalized. They can include different aggregations as maximum, minimum, average, etc.

If the information needs to have several levels of granularity, hierarchies with dimensions are created. For example, the date hierarchy could be day-week-month-year. Dimensional hierarchies have n-1 relationships so that a value of one level can only be grouped by a single value of each next higher level. This makes it easier to drill-down, decrease detail (roll-up), select (say), slice or pivot in the dimensions (pivot), which are typical of reports obtained at from the DW. A multidimensional model that does not have hierarchies is called a star model, if it had them, it is called a snowflake model.

Menu