Quantcast
Channel: UNIX and Linux Forums
Viewing all articles
Browse latest Browse all 16232

Parse multiple html files in directory

$
0
0
I have downloaded source code for 97 files using:

Code:

wget -x -i link.txt
then run a rename loop:

Code:

for file in *
do
  mv $file $file.txt
done

to keep the html tags but make the file a text that can be parsed.

In each of the 97 txt files the gene # is variable, but the gene is associated or should have a corresponding OMIM #. They are all in,
Code:

C:\Users\cmccabe\Desktop\list\geneticslab.emory.edu.txt\tests_txt
Is there a way to search the source code for these gene names and OMIM #’s?

For example, in the attached file there are 26 genes:

Output (tab-delimited)
Code:

A            B
Gene    OMIM
AKT1        164730
ALK        105590
APC        611731

The gene names seem to be after
Code:

target = '_blank'>AKT1</a>
and the OMIM # seem to be
Code:

style = 'margin-bottom:10px;'><a href =
I think
Code:

sed
can parse html but I am not familiar enough to know how to code it for multiple files in a directory.

Thank you to all for the help :).

Attached Files
File Type: txt CM080.txt (17.2 KB)

Viewing all articles
Browse latest Browse all 16232

Trending Articles