Quantcast
Channel: UNIX and Linux Forums
Viewing all articles
Browse latest Browse all 16232

Extract specific contents from each line

$
0
0
Hi all,
Happy new year!
Here I have a problem with extract specific information from each line in unix:
My file is the dbSNP flat file, take two SNPs for examples:

Code:

REFSNP-DOCSUM-SET (FULL-DUMP)
CREATED ON: 2012-06-08 10:50

rs782 | human | 9606 | snp | genotype=NO | submitterlink=YES | updated 2012-05-24 15:16
ss796 | WIAF | WIAF-4053 | orient=+ | ss_pick=YES
SNP | alleles='A/G' | het=0 | se(het)=0
VAL | validated=NO | min_prob=? | max_prob=? | notwithdrawn
CTG | assembly=GRCh37.p5 | chr=22 | chr-pos=21368027 | NT_011520.12 | ctg-start=758596 | ctg-end=758596 | loctype=2 | orient=-
LOC | MGC16703 | locus_id=113691 | fxn-class=intron-variant | mrna_acc=NR_003608.1
CTG | assembly=HuRef | chr=22 | chr-pos=4636312 | NW_001838740.2 | ctg-start=101457 | ctg-end=101457 | loctype=2 | orient=+

rs783 | human | 9606 | snp | genotype=NO | submitterlink=YES | updated 2012-05-24 15:16
ss797 | WIAF | WIAF-4054 | orient=- | ss_pick=NO
ss142579 | SC | bK747E2_58940 | orient=+ | ss_pick=NO
ss11007317 | BCM_SSAHASNP | chr22.NT_011520.9_8852191 | orient=+ | ss_pick=NO
ss13383292 | SC_SNP | NT_011520.9_8852191 | orient=+ | ss_pick=NO
ss16932069 | CSHL-HAPMAP | CSHL-HuAA-200402.chr22.NT_011520.9_8852191 | orient=+ | ss_pick=NO
ss19502978 | CSHL-HAPMAP | CSHL-HuDD-200402.chr22.NT_011520.9_8852191 | orient=+ | ss_pick=NO
ss21850370 | SSAHASNP | WGSA-200403-chr22.chr22.NT_011520.9_8852191 | orient=+ | ss_pick=NO
ss24092252 | PERLEGEN | afd0119572 | orient=+ | ss_pick=NO
ss44310901 | ABI | hCV483863 | orient=+ | ss_pick=NO
ss65824218 | KRIBB_YJKIM | KHS1 | orient=+ | ss_pick=NO
ss67995054 | ILLUMINA | HumanHap650Yv1.0_rs783 | orient=+ | ss_pick=NO
ss71559615 | ILLUMINA | HumanHap650Yv3.0_rs783 | orient=+ | ss_pick=NO
ss75334879 | ILLUMINA | ILMN_Human_1M_rs783 | orient=+ | ss_pick=NO
ss78462080 | HGSV | Cor12878_SNV_20070510.chr22_27786176 | orient=+ | ss_pick=NO
ss80220046 | HGSV | Cor18507_SNV_20070510.chr22_27786176 | orient=+ | ss_pick=NO
ss84312729 | HGSV | Cor19240_SNV_20070510.chr22_27786176 | orient=+ | ss_pick=NO
ss85606923 | HGSV | Cor19129_SNV_20070510.chr22_27786176 | orient=+ | ss_pick=NO
ss91901591 | BCMHGSC_JDW | JWB-1519492 | orient=+ | ss_pick=NO
ss96097027 | HUMANGENOME_JCVI | 1103691025572 | orient=+ | ss_pick=NO
ss103852070 | BGI | BGI_rs783 | orient=+ | ss_pick=NO
ss112600866 | 1000GENOMES | CEU.trio.12.15.2008_3797684_chr22_27791622 | orient=+ | ss_pick=NO
ss114124839 | 1000GENOMES | NA19240_2008_12_16_3433050_chr22_27791622 | orient=+ | ss_pick=NO
ss117385925 | ILLUMINA-UK | NA18507_000015702_NCBI36.1_chr22_27791622 | orient=+ | ss_pick=NO
ss119336900 | KRIBB_YJKIM | KHS1499147 | orient=+ | ss_pick=NO
ss138345568 | ENSEMBL | ENSSNP11917357 | orient=+ | ss_pick=NO
ss143589615 | ENSEMBL | ENSSNP9790017 | orient=+ | ss_pick=NO
ss157114151 | GMI | GMI_SNP_24747627 | orient=+ | ss_pick=NO
ss167823401 | COMPLETE_GENOMICS | NA07022_36_chr22_27791622 | orient=+ | ss_pick=NO
ss169076235 | COMPLETE_GENOMICS | NA19240_36_chr22_27791622 | orient=+ | ss_pick=NO
ss171908720 | COMPLETE_GENOMICS | NA20431_36_chr22_27791622 | orient=+ | ss_pick=NO
ss174572413 | ILLUMINA | Human1M-Duov3_B_rs783-127_B_R_1502386830 | orient=- | ss_pick=NO
ss204071565 | BUSHMAN | BUSHMAN-chr22-27791621 | orient=+ | ss_pick=NO
ss208816497 | BCM-HGSC-SUB | BCM_CMT_1011-3271932 | orient=+ | ss_pick=NO
ss228652888 | 1000GENOMES | pilot_1_YRI_10462571_chr22_27791622 | orient=+ | ss_pick=NO
ss238048334 | 1000GENOMES | pilot_1_CEU_7652963_chr22_27791622 | orient=+ | ss_pick=NO
ss244172327 | 1000GENOMES | pilot_1_CHB+JPT_6057404_chr22_27791622 | orient=+ | ss_pick=NO
ss283616234 | GMI | GMI_AK_SNP_7936655 | orient=+ | ss_pick=YES
ss292750126 | PJP | SNP_2256484_chr22_27791622 | orient=+ | ss_pick=NO
ss479369913 | ILLUMINA | HumanOmni2.5-4v1_D_kgp10584265-0_B_R_1817221274 | orient=- | ss_pick=NO
ss484300582 | ILLUMINA | HumanOmni2.5-4v1_B_SNP22-27791622-0_B_R_1627743334 | orient=- | ss_pick=NO
SNP | alleles='A/G' | het=0.476714 | se(het)=0.10536
VAL | validated=YES | min_prob=? | max_prob=? | notwithdrawn | byCluster,byFrequency,byOtherPop,by2Hit2Allele,byHapMap
GMAF | allele=G | count=2184 | MAF=0.391
CTG | assembly=GRCh37.p5 | chr=22 | chr-pos=29461622 | NT_011520.12 | ctg-start=8852191 | ctg-end=8852191 | loctype=2 | orient=+

I want to extract the: "alleles=XXX" and "allele=XX" for each SNP, can anyone help on this?

To make things difficulty:
1) not for every SNP, it has both. Some SNPs don't have "allele=XX" but only "alleles=XXX"
2) the "XXX" here are not just A,T,G,C, but sometimes "-".

thanks!

Viewing all articles
Browse latest Browse all 16232

Trending Articles