Compare 2 flat files

Hi Frnds,

I have a flat file with millions of records. .

Now I on this. (I prefer for AWK as its gives good performance.)

Old_file.txt

1 gopi ase ....
2 arun pl ...
3 jack sutha ..
4 peter pm ..
...

New_file.txt

4 peter pm ..
..

Outputfile.txt
2 arun pl ..
..

Code:-

nawk ' NR==FNR{a[$1]=$0 ; next} !a[$1] ' New_file.txt Old_file.txt > Output.txt
o/p:-

2 arun pl 

:D:D:D:D

Hi, I can see that You prefer AWK but the small program comm can also be very useful. I think it wouldn't barf on big files unless there are too many lines between diffs.

Example, this will show You the lines that does not exist in both files:

comm -3 oldfile.txt newfile.txt

So in the case of newfile.txt only having records removed, it would work.
Best regards,
Lakris

kindly ..but I didn't know that nawk or gawk will barf on big files!!!!
I only know that its limitation is the fields size only (columns)..
Is it right guys?

comm is the ideal candidate for doing the job. It is faster than awk.

$comm file_orig file_new
contents(only in file_orig)  contents(only in file_new)  contents (common)

ourput consists of three columns as shown.

You can skip the particular column by giving it's number as option to comm.

$comm -12 fileA fileB

skips both 1 & 2 and gives you the common contents of both files. (3 row)

For your job the command would be:

$comm  -23 orig_file new_file

gives the list of deleted records in new_file.

Assumptions made:
orig_file : file containing all the records
new_file : subset of orig_file. some records are deleted from this.

row 1 : contains the elements unique to file_orig (not there in file_new)

Hope this helps.

Cheers,
14341

14341:- I thought that the issue is awk will barf
on big files, not that "comm" is faster on execution than awk; I know that shell commands always faster than external ones.

I hope I had clear the issue.
and still I don't know if awk has limitation on file raw or columns sizes.

BR

awk has some limitation.
eg.

Number of fields per record	100
Characters per input record	3000
Characters per output record	3000
Characters per field	1024
Characters per printf string	3000
Characters in literal string	400
Characters in character class	400
Files open	15
Pipes open	1

though, gawk,mawk and other latest version are the alternatives for these limitations.

reference - Orelly - sed & awk ch 10.8 Limitations

just wanted to share this info.

anchal_khare:- Is there a limitation on the number of record ..
can nawk handle let say 10,000,000 record in one file?

Thanks in advance
BR

not sure but i guess it can handle any number of record.
may be other can confirm us.

comm compares the entire line, while the original post mentions using only the first 3 fields. Perhaps a sort prior to comm would be in order:

sort -u -k 1,3

Although the following AWK may suffice:

awk 'FNR==NR {a[$1$2$3]=$0} FNR!=NR {delete a[$1$2$3]} END {for (i in a) print a}' Old_file.txt New_file.txt 

Regards,
alister

Hi ahmad.diab,
no, I didn't insinuate that awk and siblings would barf on big files, or big records. I only expressed my uncertainty whether comm would be sufficient for the job on big files. :slight_smile:

Best regards,
Lakris

Two points:

  1. comm requires the input to be sorted.
  2. awk will be slower compared to comm

You can try:

grep -vf new_file old_file

I don't think this will scale either,

with more number of patterns to be searched and more number of input records, the search is almost, o(n2)

and that is not the case with awk where the power of associative arrays can be used.

Build a map, and check if the map as you iterate through each records but the check is that you don't which file to use to create a map as we don't know the number of records before processing.