Chitika

Monday 12 March 2012

Performance Tuning techniques in datastage

Apply the filter conditions at source level to avoid unnecessary data flow in the job design.
Extract the only required columns from source.
Create large no.of datastage jobs with less no.of stages rather then less no.of jobs with more no.of stages.
Use dataset stage to store temporary data.
Avoid using the transformer stage as much as possible because it needs to have additional C++ compiler to execute the ETL Program.
Insted of using Transformer stage use copy stage to drop the unwanted columns or to send it to multiple stages.
use filter stage to apply conditions.
use surrogate key generator stage to generate sequence number.
prefer database sorts than data stage sorts.
use transformer stage to avoid aggregator stage.

Saturday 3 March 2012

Some useful tips of how to debug parallel jobs in datastage.



Enable the following environment variables in DataStage Administrator:
* APT_PM_PLAYER_TIMING – shows how much CPU time each stage uses
* APT_PM_SHOW_PIDS – show process ID of each stage
* APT_RECORD_COUNTS – shows record counts in log
* APT_CONFIG_FILE – switch configuration file (one node, multiple nodes)
* OSH_DUMP – shows OSH code for your job. Shows if any unexpected settings were set by the GUI.
* APT_DUMP_SCORE – shows all processes and inserted operators in your job
* APT_DISABLE_COMBINATION – do not combine multiple stages in to one process. Disabling this will make it easier to see where your errors are occurring.
Use a Copy stage to dump out data to intermediate peek stages or sequential debug files. Copy stages get removed during compile time so they do not increase overhead.
Use row generator stage to generate sample data.
Look at the phantom files for additional error messages: c:\datastage\project_folder\&PH&
To catch partitioning problems, run your job with a single node configuration file and compare the output with your multi-node run. You can just look at the file size, or sort the data for a more detailed comparison (Unix sort + diff commands).

What are the Main Features in DataStage


DataStage has the following features to aid the design and processing required to
build a data warehouse:
• Uses graphical design tools. With simple point-and-click techniques you
can draw a scheme to represent your processing requirements.
• Extracts data from any number or types of database.
• Handles all the meta data definitions required to define your data warehouse. You can view and modify the table definitions at any point during
the design of your application.
• Aggregates data. You can modify SQL SELECT statements used to extract
data.
• Transforms data. DataStage has a set of predefined transforms and functions you can use to convert your data. You can easily extend the functionality by defining your own transforms to use.
• Loads the data warehouse