Please help me in the following,
I have a matrix file
a datafile
and a query file
I would like to
1. Find average of Rep only for the query columns , group by Sample and Loc , the column order is not fixed, Sample is col1 in the example but maybe col5 in the data. So they should be taken dynamically as keywords like 'Sample' from the header column in datafile. Missing data is indicated by NULL.
If some entries like T4 in the query file are not present in the data, that column name can be ignored.
2. Output both the matrix and the datafiles as separate files so that they have the same common samples , and arranged in the same sequence.
Output
Output2
Please not that columns are in same order in both files of output. Order doesn't matter as long as they are in sync in the same sequence like {s2,s1,s3} in both, {s1,s2,s3} is also acceptable.
I tried this
I have a matrix file
Code:
S2 S1 S3 S4 S5
G1 11 12 13 14 15
G2 21 22 23 24 25
G3 31 32 33 34 35
G4 41 42 43 44 45
Code:
Sample Loc Rep T1 T2 T3 RC1 RC2 RC3
S1 L1 1 1.5 NULL 45 R F T
S1 L1 2 2.5 2 NULL 35 F G
S1 L2 1 4 3 NULL F T R
S2 L1 1 56 45 24 F G Y
S2 L2 1 10 5 NULL G F Y
S2 L2 2 20 NULL 34 F G T
S3 L1 1 3.4 NULL 32 F T Y
S3 L2 1 4.6 3 21 D D R
Code:
T1
T2
T3
T4
1. Find average of Rep only for the query columns , group by Sample and Loc , the column order is not fixed, Sample is col1 in the example but maybe col5 in the data. So they should be taken dynamically as keywords like 'Sample' from the header column in datafile. Missing data is indicated by NULL.
If some entries like T4 in the query file are not present in the data, that column name can be ignored.
2. Output both the matrix and the datafiles as separate files so that they have the same common samples , and arranged in the same sequence.
Output
Code:
S2 S1 S3
G1 11 12 13
G2 21 22 23
G3 31 32 33
G4 41 42 43
Code:
S2 S1 S3
T1_L1 56 2 3.4
T1_L2 15 4 4.6
T2_L1 45 2 NULL
T2_L2 5 3 3
T3_L1 24 45 32
T3_L2 34 NULL 21
I tried this
Code:
awk ' FILENAME=="QUERY.TXT" { cols[$0];next }
FILENAME=="data.txt" && NR==1 {
for(i=1; i<=NF; i++)
{
if ($i=="Sample")
s[1]=i
if ($i=="Loc")
s[2]=i
if ($i=="Rep")
s[3]=i
}
next;
}
{
for(i=1; i<=NF; i++)
if(($i in cols))
for(j=1; j<=p; j++) {
st=a[j]
for(i in s){
st=st" "s[j];
}
print st }
if !($i in cols)
delete s["i"];
}
' QUERY.TXT data.txt data.txt mat.txt