Data Warehouse & Business Intelligence: August 2011

Tuesday, August 30, 2011

Fact Vs Dimension tables

Fact table contains events or facts in a business. A fact contains measurable values. We do analysis of facts or events
based on the query fired by end users. For example sales in a business process. so every sale is an event. So in
fact table we have fact or record for each and every event.
Dimension table contains factors by which analysis is done. For example product id is a foreign key in fact table and primary key
for a dimension table called
product. So in product table we have more information about a product which are useful facts for analyzing facts and events.
So it provides descriptive or context information for performing analysis. Fact table is in center surrounded by
dimension tables। Surrogate key of every dimension table is also kept in fact table as foreign key to define many to many relationship between ant two dimensional tables.

Surrogate Key

It's a system generated key which is not actual defined primary but work as a primary key in a table.For example in employee table we have employee number as an actual primary key, as it uniquely identifies all records in employee table. But rownum is system generated key, Oracle application generates by default one row number for each record in relational table. This row num can be used as primary key. In Informatica ETL tool we use sequence generator to generates sequences (surrogate keys). For example let's relational table has one record for employee number 101. We have one field like address for this column. Suppose tomorrow employee has three addresses. We can not increase number of fields for each record every time as it will cost. Now system will not allow for having 3 records for same emloyee id(Primary key violation). So we will generate 3 sequences for these three reocrds. Primary key constraint will be set on sequece no rather than on employee id. It's usual to have surrugate key data type as decimal. it's so because we can generate any number of desired sequences by incrementing (Arthmetic operation) existing one.

Saturday, August 27, 2011

Galaxy schema

Galaxy schema contains many fact tables with some common dimensions (conformed dimensions)। This schema is a combination of many data marts. One fact table may represent one data mart।
It is also known as fact constellation schema। It is a combination of star schema and snow flack schema. As discussed earlier it consists of conformed dimensions.
The dimensions in this schema are segregated into independent dimensions based on the levels of hierarchy. For example, if geography has four levels of hierarchy like region, country, state and city; constellation schema would have four dimensions instead of one.

As we see in the two star schemas as below, the two fact tables, sales table and purchase table are now sharing both the 'product' and 'time' dimension tables. Therefore we decide to choose Galaxy Schema as the model for our data warehouse, which is displayed as follow:

Star schema for sales

As per galaxy schema:

Friday, August 26, 2011

Various categories of Dimensions:

Though reports are generated on fact tables, data in dimension tables is used to describe the data in fact tables।
Dimensions are mainly below types.

Slowly Changing Dimension:

Slowly Changing Dimensions (SCDs) are dimensions that have data that changes slowly, rather than changing on a time-based, regular schedule.For example, you may have a dimension in your database that tracks the sales records of your company's salespeople. Creating sales reports seems simple enough, until a salesperson is transferred from one regional office to another. How do you record such a change in your sales dimension?You could sum or average the sales by salesperson, but if you use that to compare the performance of salesmen, that might give misleading information. If the salesperson that was transferred used to work in a hot market where sales were easy, and now works in a market where sales are infrequent, her totals will look much stronger than the other salespeople in her new region, even if they are just as good.

Conformed Dimension:

At the most basic level, conformed dimensions mean the exact same thing with every possible fact table to which they are joined. The date dimension table connected to the sales facts is identical to the date dimension connected to the inventory facts.
These dimensions are something that is built once in your model and can be reused multiple times with different fact tables. For example consider a model containing multiple fact tables representing different data marts. Now look for a dimension that is common to these facts tables. In this example let’s consider that the product dimension is common and hence can be reused by joining the different fact tables.

Degenerate Dimension:

The data items thar are not facts and data items that do not fit into the existing dimensions are termed as Degenerate Dimensions।They can be used as primary key for the fact table but they cannot act as foreign keys।
A column of the key section of the fact table that does not have the associated dimension table but used for reporting and analysis such column is called degenerate dimension or line item dimension।
For example we have a fact table with customer_id, product_id, branch_id, employee_id, bill_no, date in key section and price, quantity, amount in measure section. In this fact table bill_no from key section is a single value it has no associated dimension table. Instead of cteating a seperate dimension table for that single value we can
include it in fact table to improve performance.SO here the column bill_no is a degenerate dimension or line item dimension.
A Degenerate dimension is a Dimension which has only a single attribute.
This dimension is typically represented as a single field in a fact table.
When the cardinality of column value is high instead of maintaining a separate dimension and having the overhead of making a join with fact table degenerated dimensions can be build.
For example In sales fact table Invoice number is a degenerated dimension. Since Invoice Number is not tied up to an order header table hence there is no need for invoice number to join a dimensional table; hence it is referred as degenerate dimension.

Junk Dimension:

When you consolidate lots of small dimensions and instead of having 100s of small dimensions that will have few records in them all records from all these small dimension tables are loaded into ONE dimension table and we call this dimension table Junk dimension table। (Since we are storing all the junk in this one table) .

Inferred Dimension:

Inferred dimensions are the dimensions that are not retrieved when the fact data that are related to them, is ready to insert into the fact table।

Role Playing Dimension:

Dimensions are often recycled for multiple applications within the same database. For instance, a "Date" dimension can be used for "Date of Sale", as well as "Date of Delivery", or "Date of Hire". This is often referred to as a "role-playing dimension".

Dirty dimension:

A dimension which can exits more than once in dimension table.

Seeing locks on table

select a.session_id,a.oracle_username, a.os_user_name, b.owner "OBJECT OWNER", b.object_name,b.object_type,a.locked_mode from
(select object_id, SESSION_ID, ORACLE_USERNAME, OS_USER_NAME, LOCKED_MODE from v$locked_object) a,
(select object_id, owner, object_name,object_type from dba_objects) b, v$session v
where a.object_id=b.object_id and a.SESSION_ID=v.sid

Thursday, August 25, 2011

Bitmap , Btree and B+ tree indexes:

B-Tree indexes (Normal)

By default, Oracle creates B-tree indexes As the name suggests it makes use of the B-Tree parsing method to find the value being searched for. When it gets a value to be searched, it gets the nearest possible index value. Then it goes to the left node if the value it lesser or to the right if the value is higher. It goes on doing this until it reaches the leaf node which contains the value it is searching for.

How to Create an Index?

CREATE INDEX IDX_NAME
ON TABLE_NAME (Col_nam1,Col_nam2, )
COMPUTE STATISTICS ; --optional

How to Rename an Index?

ALTER INDEX IDX_NAME
RENAME TO NEW_IDX_NAME;
How to Collect Statistics on an Index?

ALTER INDEX IDX_NAME

COMPUTE STATISTICS;
How to Drop an Index?

DROP INDEX IDX_NAME;

Bitmap indexes

Oracle Bitmap Indexes are different from Standard B-Tree Indexes in the way they store the information about the indexes. Bitmap Indexes stores the information about the value in bits; with 0s and 1s; thus called Bitmaps. Something like this:

Bitmap Index creates a two-dimensional array with one column for every row in the table being indexed. If the value being indexed is found in the row, it will specify a 1 and if not found then a 0 . So while searching, Oracle takes only the information from those rows which shows a 1 in the Bitmap Index. That means, the records which are pointed by the 1 s in the Index, have got the searched/indexed value.

How to Create a Bitmap Index?

CREATE BITMAP INDEX IDX_NAME
ON TABLE_NAME (Col_nam1,Col_nam2, )
COMPUTE STATISTICS ;

DML Operations:

If we have too much DML operations (insert/update/delete) on a table on which the columns are indexed, it is better to use B-Tree indexes।Updating bitmapped indexes consumes lot of resources. So, it is better to use Bitmap Indexes on columns of Tables which are updated/inserted/deleted less frequently .

OLAP system and hence for Star scema: Bit map index is more useful.

OLTP and ODS system B tree index is more usefull.

Performance:

Compressed structure in Bitmap Indexes makes it faster to read from disk; but utilises more CPU to decompress the information.

Uncompressed structure in B-Tree involves a lesser CPU load compared to Bitmap but it requires more bandwidth to read since it is uncompressed data.

Cardinality:

Cardinality means, the ratio of distinct values to the total number of records. So, a low cardinality means, distinct values are very less.
In such a scenario, Bitmap indexes are considered to be more efficient thatB-Tree indexes.

ex indexing on primary key means lot of distinct values. hence after decompress also it will not take considerable less space so searching time will not be much lesser compared to B tree index.

Indexing on non primary key like gender will have less distinct values and hence Bitmap index will help here.

B+ tree indexes:

A B+-tree in certain aspects is a generalization of a binary search tree (BST). The main difference is that nodes of a B+-tree will point to many children nodes rather than being limited to only two. Since our goal is to minimize disk accesses whenever we are trying to locate records, we want to make the height of the multiway search tree as small as possible. This goal is achieved by having the tree branch in large amounts at each node.
The B+-tree stores records (or pointers to actual records) only at the leaf nodes, which are all found at the same level in the tree, so the tree is always height balanced.
Internal nodes store search key values, and are used only as placeholders to guide the search.

Materialized View

A materialized view is a table that stores derived data (Data from main table for which view is created)

During its creation, we specify the SQL used to populate the materialized view.

For example, if you have a large SALES table, we may create a materialized view that sums the SALES data by region. If a user queries the SALES table for the sum of the SALES data for a region, ORACLE can redirect that query to use your materialized view in place of the SALES table.
As a result, we can reduce the number of accesses against our largest tables, improving the system performance.

Example:

create materialized view SALES_MONTH_MV
tablespace AGG_DATA
refresh complete
start with sysdate
next sysdate+1
enable query rewrite
as
select Sales_Month, SUM(Amount)
from SALES
group by Sales_Month;

The enable query rewrite clause enables the optimizer to redirect queries of SALES to SALES_MONTH_MV if appropriate.

The tablespace clause tells ORACLE where to store the local table for the materialized view.

The data will be automatically refreshed if you have enabled background job processes (via the JOB_QUEUE_PROCESSES init.ora parameter).

The start with and next clauses tell ORACLE when to schedule refreshes of the data.

In this example, a complete refresh of the view is chosen-each time the view is refreshed, its data will be completely deleted and re-created।

Fast refreshes of materialized views use materialized view logs. A materialized view log is a table stored along with the master table for the materialized view. As rows change in the master table, the changes are written to the materialized view log. During a fast refresh, the changed rows from the master table, as identified via the materialized view log, are sent to the materialized view. If the changes account for less than 25 percent of the rows in the master table, a fast refresh is generally faster than a complete refresh.

When DML changes are made to master table data, Oracle Database stores rows describing those changes in the materialized view log and then uses the materialized view log to refresh materialized views based on the master table. This process is called incremental or fast refresh. Without a materialized view log, Oracle Database must reexecute the materialized view query to refresh the materialized view. This process is called a complete refresh. Usually, a fast refresh takes less time than a complete refresh.

A materialized view log is located in the master database in the same schema as the master table। A master table can have only one materialized view log defined on it. Oracle Database can use this materialized view log to perform fast refreshes for all fast-refreshable materialized views based on the master table.

Example:

CREATE MATERIALIZED VIEW LOG ON customers WITH PRIMARY KEY, ROWID;

We can index the materialized view’s local table to improve the performance of queries against the materialized view.

Depending on the complexity of the materialized view, ORACLE may also create an index on the materialized view’s local table।

To enable a materialized view for query rewrite, all of the master tables for the materialized view must be in the materialized view’s schema, and you must have the QUERY REWRITE system privilege। If the view and the tables are in separate schemas, you must have the GLOBAL QUERY REWRITE system privilege. In general, you should create materialized views in the same schema as the tables on which they are based; otherwise, you will need to manage the permissions and grants required to create and maintain the materialized view.

Example:

CREATE MATERIALIZED VIEW products_mv
REFRESH FAST ON COMMIT
AS SELECT SUM(list_price - min_price), category_id
FROM product_information
GROUP BY category_id;