Problem Summary
Duplicate rows in the source, encountering primary key violation errors on the target table is a common issue encountered when running PowerCenter sessions.
You can use a Lookup transformation to find duplicate data in a target based on a key field (or a composite key). This works when comparing source rows to rows already existing in the target table.
But this technique does not entirely work when the source contains duplicate rows which do not exist in the target already.
You may attempt to use an uncached Lookup against the target and set the target commit interval to one so that the target table is queried for each row. Because of the way in which the PowerCenter Server processes rows (as described in LookUp Strategies Section ), this approach does not work. Even if this method worked, it would be very inefficient and performance would be severely affected.
Lookup Strategies
If you have duplicate rows in your source data, using a Lookup transformation will not reliably detect them.
A cached Lookup transformation caches the target rows at the beginning of the session and can detect duplicate rows that are already in the target, but not new duplicate rows being loaded in the same session. Turning off caching will not reliably catch duplicates. It ensures the target is re-queried for each row, but uncommitted rows temporarily stored buffers will not be accessible by the Lookup. Setting the commit point to one also will not work reliably. A database commit operation takes some milliseconds to complete, and the transformation engine does not pause - it keeps processing and transforming rows and buffering them in memory. So by the time the commit is complete, 100 rows might have been processed. If the 101st row is a duplicate the Lookup will find it, but not if there was a duplicate in the first 100 rows. A dynamic lookup can be used to overcome the above limitation.
The following examples demonstrate some of the most efficient ways to deal with duplicate source data.
Rank() Function is Source Qualifier SQL
Eliminate duplicate at the time when you are reading it from source.
or let's say you want to load the duplicate entries as an older version then you can pull all the data with respective ranks and use router transformation to route rank=1 and apply your logic accordingly and rank!=1 just load it as an older version of the records.
Aggregator Mapping - essentially same as above but within Informatica.
When to use Any time an Aggregator transformation has no major impact on performance or the source is relational. Or source is file and output row order does not matter.
Pros: Ease of implementation. Works with all types of sources.
Cons: Limits in the choice of output row. Aggregator cache can slow performance of large loads. Duplicates are silently removed and therefore no auditing.
Duplicate rows in the source, encountering primary key violation errors on the target table is a common issue encountered when running PowerCenter sessions.
You can use a Lookup transformation to find duplicate data in a target based on a key field (or a composite key). This works when comparing source rows to rows already existing in the target table.
But this technique does not entirely work when the source contains duplicate rows which do not exist in the target already.
You may attempt to use an uncached Lookup against the target and set the target commit interval to one so that the target table is queried for each row. Because of the way in which the PowerCenter Server processes rows (as described in LookUp Strategies Section ), this approach does not work. Even if this method worked, it would be very inefficient and performance would be severely affected.
Lookup Strategies
If you have duplicate rows in your source data, using a Lookup transformation will not reliably detect them.
A cached Lookup transformation caches the target rows at the beginning of the session and can detect duplicate rows that are already in the target, but not new duplicate rows being loaded in the same session. Turning off caching will not reliably catch duplicates. It ensures the target is re-queried for each row, but uncommitted rows temporarily stored buffers will not be accessible by the Lookup. Setting the commit point to one also will not work reliably. A database commit operation takes some milliseconds to complete, and the transformation engine does not pause - it keeps processing and transforming rows and buffering them in memory. So by the time the commit is complete, 100 rows might have been processed. If the 101st row is a duplicate the Lookup will find it, but not if there was a duplicate in the first 100 rows. A dynamic lookup can be used to overcome the above limitation.
The following examples demonstrate some of the most efficient ways to deal with duplicate source data.
Rank() Function is Source Qualifier SQL
Eliminate duplicate at the time when you are reading it from source.
or let's say you want to load the duplicate entries as an older version then you can pull all the data with respective ranks and use router transformation to route rank=1 and apply your logic accordingly and rank!=1 just load it as an older version of the records.
Aggregator Mapping - essentially same as above but within Informatica.
When to use Any time an Aggregator transformation has no major impact on performance or the source is relational. Or source is file and output row order does not matter.
Pros: Ease of implementation. Works with all types of sources.
Cons: Limits in the choice of output row. Aggregator cache can slow performance of large loads. Duplicates are silently removed and therefore no auditing.