SAS程序猿/媛在处理数据的时候,经常会遇到要处理有关重复记录的问题,其中有些重复记录是我们需要的,而有的则是多余的。如果是多余的直接去重:
- PROC SORT,其中有两个选项NODUPKEY、NODUPRECS(NODUP),第一个是按照BY变量来去重,第二是比较整条记录来去重,重复的记录可以用DUPOUT=来保留。程序如下:
proc sort data=sashelp.class out=unq nodupkey dupout=dup; by WEIGHT; run;
- HASH,程序如下:
data _null_; if 0 then set sashelp.class; if _n_=1 then do; declare hash h(dataset: 'sashelp.class', ordered: 'y'); h.definekey('WEIGHT'); h.definedata(all:'y'); h.definedone(); end; h.output(dataset: 'uni'); stop; run;
如果重复记录是需要保留以备后用则可以用下面几种方法:
- DATA步,程序如下:
proc sort data=sashelp.class out=class; by WEIGHT; run; data uni dup; set class; by WEIGHT; if first.WEIGHT and last.WEIGHT then output uni; else output dup; run;
- PROC SQL,程序如下:
proc sql; create table uni as select * from sashelp.class group by WEIGHT having count(*) = 1 ; create table dup as select * from sashelp.class group by WEIGHT having count(*) > 1 ; quit;
- HASH,程序(SAS9.2+)如下:
data uni(drop=rc: i); if _n_=1 then do; if 0 then set sashelp.class; dcl hash h1(dataset: 'sashelp.class', multidata:'y'); h1.definekey('WEIGHT'); h1.definedata(all: 'yes'); h1.definedone(); dcl hash h2(dataset: 'sashelp.class'); dcl hiter hi('h2'); h2.definekey('WEIGHT'); h2.definedone(); end; rc1=hi.first(); do while(rc1=0); rc2= h1.find(); i=0; do while(rc2=0 and i < 2); i+1; rc2=h1.find_next(); end; if i < 2 then do; output; if i < 2 then h1.remove(); end; rc1=hi.next(); end; h1.output(dataset: 'dup'); run;
不管是去重还是保留重复的记录,上面几种方法中HASH行数都是最多的,但是这种方法在去重之前不用排序,故当处理的数据集较大时建议使用此方法以提高效率。