This post introduces you how to compare and combine datasets in SAS and SPSS (syntax).
Comparing Data-Sets in SAS
Comparing between two or more data sets can be done in SAS with the COMPARE procedure. The procedure can be used for all types of variables. In addition, the output of this comparison can be saved in a new (SAS) data-set.
First, we will create two similar data-sets: dataset1 and dataset2.
data work.dataset1; input character$3. number; datalines; abc 123 bcd 234 cde 345 ; run; data work.dataset2; input character$3.number; datalines; abd 123 bcd 234 cde 346 ; run;
The data-sets are similar but not identical – the name of ID 1 is different, as well as the ID 3’s number.
Now we can use the COMPARE procedure to compare between the two data-sets.
proc compare base=work.dataset1 compare=work.dataset2 out=work.combo outbase outdif allvar; run;
Base is our master file and after ‘compare’ we should specify the other file(s).
The comparison here was conducted for all the variables in the data files. Comparison along some of the variables included in the files can be done with the option of ‘BY= ‘. The data-sets should be sorted along these variables, with the SORT procedure, before running the COMPARE procedure.
With the option of ‘OUT’ a new data file will be created from the output of this procedure and with the option of ‘OUTDIF’ the matches and differences between the two data-sets will be specified in our (output) new file. Matches will have the value ‘.’ for a character variable and ‘0’ for a numeric variable. Differences will be indicated with the value ‘X’ for a character and the numeric difference will be specified for a numeric variable. All will be specified in our new output data file under under the column ‘Type of Observation’ as ‘DIF’.
Specification of the observations in the base data set or in the transaction files can be done with the options of ‘OUTBASE’ ‘OUTCOMP’ respectively. These will be indicated under the column ‘Type of Observation’ as ‘BASE’ ‘COMPARE’ respectively.
Comparing and Combining Data-Sets Using SPSS (Syntax)
The UPDATE command replaces values in a master file with other values recorded in one or more other files. It compares records in the master file and transaction file(s) according to a key variable and updates the master file or creates a new file which combines all unique records.
Here we created similar but not identical data-sets: dataset1 and dataset2.
data list list / id * name (A3) number. begin data 1 "abc" 123 2 "bcd" 234 3 "cde" 345 end data. save outfile "Macintosh HD:Users:Documents\dataset1.sav".
data list list / id * name (A3) number. begin data 1 "abd" 123 2 "bcd" 234 3 "cde" 344 end data. save outfile "Macintosh HD:Users:Documents\dataset2.sav".
Now we are going to compare between the two, while dataset1 is the master file and dataset2 is a transaction file.
update file = "Macintosh HD:Users:Documents\dataset1.sav" /in = flag /file = "Macintosh HD:Users:Documents\dataset2.sav" /by all /keep=all.
The subcommand IN will create a new variable (‘flag’) in the output file that indicates whether the records in both (or more) files are identical (indicates as ‘0’), or different (indicates as ‘1’). Non-matching records will also be listed. In other words, duplicate non matching records would be added, with value ‘1’ to those included in the master file and the value of ‘0’ to those recorded in the transaction file. For this reason, we would recommend to save the output in a separate new data file, with SAVE OUTFILE command.
BY indicates which variable(s) to compare between files. The variables can be numeric or string, as is illustrated in the current example. It can also be done along all the variables included in the file, here we will use the BY ALL.
As we may create a new data file (which I encourage to do), we can also specify our preferences with the subcommand KEEP or DROP (of variables).
Questions? Comments? Drop me a line.. email@example.com