Chitika

Monday, 2 April 2012

Scenario13

source has 2 fields like

COMPANY             LOCATION
IBM                  HYD
TCS                  BAN
IBM                  CHE
HCL                  HYD
TCS                  CHE
IBM                  BAN
HCL                  BAN
HCL                  CHE

LIKE THIS.......

THEN THE OUTPUT LOOKS LIKE THIS....

Company loc count

TCS  HYD  3
         BAN
         CHE
IBM  HYD  3
         BAN
         CHE
HCL  HYD  3
         BAN
         CHE

Monday, 12 March 2012

Performance Tuning techniques in datastage

Apply the filter conditions at source level to avoid unnecessary data flow in the job design.
Extract the only required columns from source.
Create large no.of datastage jobs with less no.of stages rather then less no.of jobs with more no.of stages.
Use dataset stage to store temporary data.
Avoid using the transformer stage as much as possible because it needs to have additional C++ compiler to execute the ETL Program.
Insted of using Transformer stage use copy stage to drop the unwanted columns or to send it to multiple stages.
use filter stage to apply conditions.
use surrogate key generator stage to generate sequence number.
prefer database sorts than data stage sorts.
use transformer stage to avoid aggregator stage.

Saturday, 3 March 2012

Some useful tips of how to debug parallel jobs in datastage.



Enable the following environment variables in DataStage Administrator:
* APT_PM_PLAYER_TIMING – shows how much CPU time each stage uses
* APT_PM_SHOW_PIDS – show process ID of each stage
* APT_RECORD_COUNTS – shows record counts in log
* APT_CONFIG_FILE – switch configuration file (one node, multiple nodes)
* OSH_DUMP – shows OSH code for your job. Shows if any unexpected settings were set by the GUI.
* APT_DUMP_SCORE – shows all processes and inserted operators in your job
* APT_DISABLE_COMBINATION – do not combine multiple stages in to one process. Disabling this will make it easier to see where your errors are occurring.
Use a Copy stage to dump out data to intermediate peek stages or sequential debug files. Copy stages get removed during compile time so they do not increase overhead.
Use row generator stage to generate sample data.
Look at the phantom files for additional error messages: c:\datastage\project_folder\&PH&
To catch partitioning problems, run your job with a single node configuration file and compare the output with your multi-node run. You can just look at the file size, or sort the data for a more detailed comparison (Unix sort + diff commands).

What are the Main Features in DataStage


DataStage has the following features to aid the design and processing required to
build a data warehouse:
• Uses graphical design tools. With simple point-and-click techniques you
can draw a scheme to represent your processing requirements.
• Extracts data from any number or types of database.
• Handles all the meta data definitions required to define your data warehouse. You can view and modify the table definitions at any point during
the design of your application.
• Aggregates data. You can modify SQL SELECT statements used to extract
data.
• Transforms data. DataStage has a set of predefined transforms and functions you can use to convert your data. You can easily extend the functionality by defining your own transforms to use.
• Loads the data warehouse

Tuesday, 28 February 2012

Scenario12

There are 5 columns in a flat file, How to read 2nd and 5th columns only..

Scenario11

Difference between Trim, Trim B, Trim L, Trim T, Trim F, Trim Leading Trailing, Strip White Spaces..

Scenario10

How to abort the job after 50 rows are processed (assume there are 500 records available in a flat file)..