Compare 2 flat files

Gopal_Engg · December 29, 2009, 5:04am

Hi Frnds,

I have a flat file with millions of records. .

Now I on this. (I prefer for AWK as its gives good performance.)

Old_file.txt

1 gopi ase ....
2 arun pl ...
3 jack sutha ..
4 peter pm ..
...

New_file.txt

4 peter pm ..
..

Outputfile.txt
2 arun pl ..
..

ahmad.diab · December 29, 2009, 5:16am

Code:-

nawk ' NR==FNR{a[$1]=$0 ; next} !a[$1] ' New_file.txt Old_file.txt > Output.txt

o/p:-

2 arun pl

:D:D:D:D

Lakris · December 29, 2009, 6:35am

Hi, I can see that You prefer AWK but the small program comm can also be very useful. I think it wouldn't barf on big files unless there are too many lines between diffs.

Example, this will show You the lines that does not exist in both files:

comm -3 oldfile.txt newfile.txt

So in the case of newfile.txt only having records removed, it would work.
Best regards,
Lakris

ahmad.diab · December 29, 2009, 6:59am

lakris:

Hi, I can see that You prefer AWK but the small program comm can also be very useful. I think it wouldn't barf on big files unless there are too many lines between diffs.

Example, this will show You the lines that does not exist in both files:
comm -3 oldfile.txt newfile.txt
So in the case of newfile.txt only having records removed, it would work.
Best regards,
Lakris

kindly ..but I didn't know that nawk or gawk will barf on big files!!!!
I only know that its limitation is the fields size only (columns)..
Is it right guys?

14341 · December 29, 2009, 7:07am

comm is the ideal candidate for doing the job. It is faster than awk.

$comm file_orig file_new
contents(only in file_orig)  contents(only in file_new)  contents (common)

ourput consists of three columns as shown.

You can skip the particular column by giving it's number as option to comm.

$comm -12 fileA fileB

skips both 1 & 2 and gives you the common contents of both files. (3 row)

For your job the command would be:

$comm  -23 orig_file new_file

gives the list of deleted records in new_file.

Assumptions made:
orig_file : file containing all the records
new_file : subset of orig_file. some records are deleted from this.

row 1 : contains the elements unique to file_orig (not there in file_new)

Hope this helps.

Cheers,
14341

ahmad.diab · December 29, 2009, 8:10am

14341:- I thought that the issue is awk will barf
on big files, not that "comm" is faster on execution than awk; I know that shell commands always faster than external ones.

I hope I had clear the issue.
and still I don't know if awk has limitation on file raw or columns sizes.

BR

clx · December 29, 2009, 8:41am

awk has some limitation.
eg.

Number of fields per record	100
Characters per input record	3000
Characters per output record	3000
Characters per field	1024
Characters per printf string	3000
Characters in literal string	400
Characters in character class	400
Files open	15
Pipes open	1

though, gawk,mawk and other latest version are the alternatives for these limitations.

reference - Orelly - sed & awk ch 10.8 Limitations

just wanted to share this info.

ahmad.diab · December 29, 2009, 8:46am

anchal_khare:- Is there a limitation on the number of record ..
can nawk handle let say 10,000,000 record in one file?

Thanks in advance
BR

clx · December 29, 2009, 8:50am

not sure but i guess it can handle any number of record.
may be other can confirm us.

alister · December 29, 2009, 9:32am

comm compares the entire line, while the original post mentions using only the first 3 fields. Perhaps a sort prior to comm would be in order:

sort -u -k 1,3

Although the following AWK may suffice:

awk 'FNR==NR {a[$1$2$3]=$0} FNR!=NR {delete a[$1$2$3]} END {for (i in a) print a}' Old_file.txt New_file.txt

Regards,
alister

Lakris · January 4, 2010, 4:23am

Hi ahmad.diab,
no, I didn't insinuate that awk and siblings would barf on big files, or big records. I only expressed my uncertainty whether comm would be sufficient for the job on big files.

Best regards,
Lakris

dennis.jacob · January 4, 2010, 5:39am

Two points:

comm requires the input to be sorted.
awk will be slower compared to comm

You can try:

grep -vf new_file old_file

matrixmadhan · January 4, 2010, 9:48am

I don't think this will scale either,

with more number of patterns to be searched and more number of input records, the search is almost, o(n2)

and that is not the case with awk where the power of associative arrays can be used.

Build a map, and check if the map as you iterate through each records but the check is that you don't which file to use to create a map as we don't know the number of records before processing.