0% found this document useful (0 votes)
60 views6 pages

Data Processing and File Management Techniques

The document outlines a data processing task involving splitting records into multiple files based on specified criteria, along with various operations such as joins, sorting, and filtering. It provides answers to common questions related to data manipulation in a specific framework, including handling duplicates, executing graphs, and database configuration. Additionally, it includes examples of extracting data from files and the necessary tags for different database connections.

Uploaded by

divyadara63
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views6 pages

Data Processing and File Management Techniques

The document outlines a data processing task involving splitting records into multiple files based on specified criteria, along with various operations such as joins, sorting, and filtering. It provides answers to common questions related to data manipulation in a specific framework, including handling duplicates, executing graphs, and database configuration. Additionally, it includes examples of extracting data from files and the necessary tags for different database connections.

Uploaded by

divyadara63
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd

Q-1 : Split 11 records into three files - first record should go to all three files, record

from 2-5 should go to 2nd [Link] from 6-11 should go to 3rd file.

Graph 1 :

/*Function returning vector of indexes of output ports*/

out::output_indexes(in)=

begin
let decimal(",") count=next_in_sequence();
out:1: if(count == 1) [vector 0,1,2];
out:2: if((count >1) && (count <6)) [vector 1];
out:: if(count >=6 ) [vector 2];
end;

Graph -2

Reformat :-

/*Reformat operation*/
out::reformat(in)=
begin
[Link]::next_in_sequence();
out.* ::in.*;
end;

Write Multi File : -

/* This type is optional.*/


// type output_type = record
// string("\n") parameter_value ;
// end;

/*extract filename from input record*/


filename::get_filename(in)=
begin
let decimal(",") count=next_in_sequence();
filename:: if (count == 1) "/home/mb76842/file1"
else if (count >1 && count <6) "/home/mb76842/file2"
else if (count >5) "/home/mb76842/file3";
end;

/* This function is optional. */


/*Create output record*/
write::reformat(in)=
begin
[Link] :: [Link];
write.parameter_value :: in.parameter_value;
end;
type output_type=record
decimal(",") value=0;
string("\n") parameter_value="" ;
end; /*Metadata for records written to output files*/

Join
Inner joins

The most common case is when join-type is Inner Join. In this case, if each input port
contains a record with the same value for the key fields, the transform function is called
and an output record is produced.

If some of the input flows have more than one record with that key value, the transform
function is called multiple times, once for each possible combination of records, taken
one from each input port.

Whenever a particular key value does not have a matching record on every input port and
Inner Join is specified, the transform function is not called and all incoming records with
that key value are sent to the unusedn ports.

Full outer joins

Another common case is when join-type is Full Outer Join: if each input port has a
record with a matching key value, JOIN does the same thing it does for an inner join.
If some input ports do not have records with matching key values, JOIN applies the
transform function anyway, with NULL substituted for the missing records. The missing
records are in effect ignored.

With an outer join, the transform function typically requires additional rules (as
compared to an inner join) to handle the possibility of NULL inputs.

About explicit joins

The final case is when join-type is Explicit. This setting allows you to specify True or
False for the record-requiredn parameter for each inn port. The settings you choose
determine when JOIN calls the transform function. See record-requiredn.

How can we delete duplicate records by using a rollup component in abinitio .

Miscellaneous
1> Let there r 1 GB of data and u have mentioned 100 mb as max value for max_core in sort
component. Will this data will sort. If yes how and when the temp file will create and where?

Ans: Yes , the data will sort .The process will create 3*1 GB( or 1000Mb) / 100Mb = 30
Numbers of temporary files in the disk to sort the records .

2> There is an input file using lookup function. When data processed for a particular input value
there are 10 matching records in the look up file. Which records from the file the process will
pick ?

Ans : The first matching records of the lookup file .

3> There is a file having one field "EMPNO" of value (0,1,2,3,4,0,1,2,3,4,0,1,2,3,4). The data is
processed through a partition by expn component with expn "empno*1", Wt will be the o/p?

4> There are 100 records in a file. How to calculate the total no of records in the file using
REFORMAT instead of rollup?

Ans: by Using global variables

5> Wt is the order of execution of a graph wn it runs?

6> There is a serial i/p file followed by a partition by round robin and the reformat then a serial o/p
file will this graph execute?
Ans: Yes , It ll run .

7> If few records having join field value a "null" the how the join component will behave? .
Ans : The records having NULL in join field will not get in the out put .

8> If we r not using sort component before dedup then how will it behave?
Ans : It ll behave properly .
9> function used to transfer a string into decimal?
Ans : Just Type cast the string value as - (decimal(",")) <string_value> ;

10> How to convert 4-way mfs to 8-way and vice versa?


Ans: Input File( 4 way ) - > Partition By Key - -> Merge -- Output ( 8-way ) .

And by using Partition by key and sort comonet(repartitioning)

11> in input dataset I m having 100 records. I want record betn 50-75 and I don't want 5th record
which component I have to use?
Ans: Filter By expression or Reformat.

12> How to extract the last record in a file ? wt will be the graph? wt will be the sort key for that?
Input File - Reformat ( To add a new field (let) C==next_in_sequence)  Sort( des , C )
Reformat ( select next_in_sequence()==1 , Revert back to original DML )  Output .

13> In dept table I have deptno=10,20,30,40,50. I want dept=10 to one file, dept=20 to 2nd file
and rest to 3rd file. Wt will be the simplest graph?

14> suppose I have 3 keys in a table. Wt will be the result wn I pass it through ROLLUP. Will it be
combination of 3 keys or what?
Ans : Yes it will be a combination of 3 fields .

15> I have a graph running> Suddenly power fails. When I restart, what GDE shows?

16> I have a replicate to join will the graph execute?

17> Scenario where to use Partition by load balance.

18 . Which Variable contains the graph return value ?


Ans : mpjret
19 . How can we range a input value ?
Find Split …

1: Extract third Line of a File


awk 'NR==3' <file_name>
-----------------------------------------------------
2: Extract 4th Character of third Line in a File.
head -3 file | tail -1 | cut -c4

3. Extract 4th Field of third Line in a File. ( where delimeter is ",") ]


head -3 file | tail -1 | awk -F"," '{ print $4 } '
sed -n '3 p' filename |awk -F"," '{ print $4 }'

4. How can we check the status of a graph in end script .


------------------------------------------------------------------------
A. New database connection setting :
1. select the database , version number and whether we are connecting remotely or not .

2. Select a data base componenet from the component organizer .

3. Generate the skeleton database configuration file .


Input Table : Config File
Update table : DBConfig File

4. Configure the database setting in a file and save as <name>.dbc .

B. Test The Database Confiuration :


From GDE : Select the database file in the Config File or DBConfig File Coulumn and
the right click on the
ConfigFile button and click on the test .
From Command Line : m_db test <database confiuration file path > .

C. Then select the table name in the source column .

Containts of the .dbc file :


--------------------------------------
The .dbc file contains seris of line s of text in the below format .
tag:Value or tag:value1 value2 val;ue3 ....... or
tag:value1
tag:value2 .

The # treated as teh comment line in the dbc file . To make # as a commnet line
character where ever iot located we have to set a paarmeter
Ab_IDB_ALLOW_INLINE_COMMENT=true .
-----------------------------------------------------------
Oracle Required Tags :

db_version - Oracle version .


db_home - $ORACLE_HOME - Oracle home path .
db_name - $ORACLE_SID - Oracle Instance name .
db_node - Machine on wich oracle server or client software runs .

Adabas Required Tags :

db_node - Machine on wich Adabas server or client software runs .


adaload : Full pathname of adabas load libray .

DB2 EE ( Enterprize Edition) Required Tags :

db_version - Db2 version .


db_home - $DB2INSTANCE
db_name - $DB2DBFT- DB2 database Name .
db_node - Machine(s) on wich oracle server or client software runs .

DB2 EEE ( Enterprize Extendent Edition) Required Tags :

db_version - Db2 version .


db_home - $DB2INSTANCE
db_name - $DB2DBFT- DB2 database Name .
db_node - Machine(s) on wich oracle server or client software runs .

Teradata Required Tags :

odbc_data_source_name: ODBC data source name .


db_name - Teradata TDPID .
db_node - Machine on wich Teradata utilities runs .

Common questions

Powered by AI

To convert from 4-way to 8-way mfs, partition the input file by key, merge the partitions, and then repartition to 8-way using the sort component. For the opposite conversion, reverse the process by adjusting partitioning specifications accordingly .

The system uses a function that returns a vector of output ports indexes, which decides the output file based on the record sequence. Records are split such that the first record goes to all three files, records 2-5 go to the second file, and records 6-11 go to the third file .

When processing includes a lookup function with multiple matching records, the system selects the first matching record by default, ensuring consistency in retrieval by following a hierarchical or sequential check .

For an Inner Join, the transform function is called only when every input port contains a record with matching key values, ignoring those without matches. In contrast, a Full Outer Join calls the transform function regardless of matched key values, substituting NULLs for missing records and requiring additional handling for possible NULL inputs .

In join operations, records with null values in join fields are excluded from the output. To manage this, the design of the join component must include conditional checks or default substitutions to accommodate or filter these records, ensuring data integrity in output .

The order of execution for a graph is determined by the configuration and dependencies defined in the graph's design. This order impacts the workflow, as certain components may rely on outputs from prior stages, influencing overall efficiency and processing outcomes .

To calculate the total number of records using a REFORMAT component, employ global variables within the REFORMAT operation, allowing accumulation or counting as records are processed, culminating in a total count as output .

After a power failure, when the graph is restarted, the GDE (Graphical Development Environment) should ideally show the graph in the state at which failure occurred, allowing for recovery and continuation of processing .

Using the 'Partition by Expn' component with the expression 'empno*1' results in partitioning based on the 'EMPNO' values directly. If the EMPNO is repeated, the partitioning will distribute records evenly without altering the order or creating unique partitions for repeat values .

The status of a graph post-execution can be checked using an end script. This typically involves evaluating variables or output logs that report the graph's execution status or using predefined scripts or commands designed to check end status .

You might also like