今天偶然看到一个古老的帖子:统计一篇文章中各字母的出现的次数和频率。先说统计单词的问题。最直接的方法应该是将文章按单词分成多行,每行一个单词,再用PROC FREQ即可求得频数和频率。程序如下:
data;
TEXT="It is Teacher's Day today. On this special occasion I would like to extend my heartfelt congratulations to all teachers, Happy Teacher's Day! Of all teachers who have taught me since my early childhood, the most unforgettable one is my first English teacher in college, Ms. Zhang. It is she who has aroused my keen interest in the learning of English and helped me realize the importance of self-reliance. Born into a poor farmer's family in a mountainous area and educated in relatively primitive surroundings, I found myself lagging far behind in the first class in college, which happened to be Ms. Zhang's English class. I was really discouraged and frustrated, so I decided to drop out. Ms. Zhang was so keenly insightful that she had noticed my embarrassment in class. After class, she called me into the Teacher's Room and discussed the situation with me, earnestly and kindly, citing the example of Robinson Crusoe to motivate me to go ahead in spite of all kinds of difficulties. Be a man and rely on yourself, she nudged me. The next time we met, she brought me a simplified version of Robinson Crusoe and recommended that I finish reading it in a week and write a book report. Under her consistent and patient guidance, not only has my English been greatly improved, but my confidence and courage enhanced considerably. Rely on yourself and be a man, Ms. Zhang's inspiring words have been echoing in my mind. I will work harder and try my utmost to lay a solid foundation for my future career. Only by so doing can I repay Ms. Zhang's kindness and live up to her expectations of me, that is, to become a useful person and contribute to society.";
i=1;
do until(scan(TEXT, i)='');
WORD=scan(TEXT, i);
output;
i+1;
end;
run;
proc freq;
tables WORD / noprint out=counts;
run;
结果如下:
上面的方法也可以用来处理统计字母频率的问题,但是有点LOW。因为文章一长,行数就会非常多。下面介绍使用CALL PRXNEXT的方法:
data demo;
TEXT="It is Teacher's Day today. On this special occasion I would like to extend my heartfelt congratulations to all teachers, Happy Teacher's Day! Of all teachers who have taught me since my early childhood, the most unforgettable one is my first English teacher in college, Ms. Zhang. It is she who has aroused my keen interest in the learning of English and helped me realize the importance of self-reliance. Born into a poor farmer's family in a mountainous area and educated in relatively primitive surroundings, I found myself lagging far behind in the first class in college, which happened to be Ms. Zhang's English class. I was really discouraged and frustrated, so I decided to drop out. Ms. Zhang was so keenly insightful that she had noticed my embarrassment in class. After class, she called me into the Teacher's Room and discussed the situation with me, earnestly and kindly, citing the example of Robinson Crusoe to motivate me to go ahead in spite of all kinds of difficulties. Be a man and rely on yourself, she nudged me. The next time we met, she brought me a simplified version of Robinson Crusoe and recommended that I finish reading it in a week and write a book report. Under her consistent and patient guidance, not only has my English been greatly improved, but my confidence and courage enhanced considerably. Rely on yourself and be a man, Ms. Zhang's inspiring words have been echoing in my mind. I will work harder and try my utmost to lay a solid foundation for my future career. Only by so doing can I repay Ms. Zhang's kindness and live up to her expectations of me, that is, to become a useful person and contribute to society.";
TEXT_TEMP=TEXT;
if _N_=1 then do;
RE1=prxparse('s/(\b.+?\b)(\s.*?)(\b\1+\b)/\2\3/i');
RE2=prxparse('/(\b.+?\b)(\s.*?)(\b\1+\b)/i');
end;
/*Remove repeated values*/
do i=1 to 1000;
TEXT=prxchange(RE1, -1, cats(TEXT));
if not prxmatch(RE2, cats(TEXT)) then leave;
end;
do i=1 to countw(TEXT);
WORD=scan(TEXT, i);
COUNT=0;
RE=prxparse('/\b'||cats(WORD)||'\b/i');
START=1;
STOP=length(TEXT_TEMP);
call prxnext(RE, START, STOP, TEXT_TEMP, POSITION, LENGTH);
do while(POSITION>0);
COUNT+1;
call prxnext(RE, START, STOP, TEXT_TEMP, POSITION, LENGTH);
end;
FREQ=COUNT/countw(TEXT_TEMP)*100;
keep WORD COUNT FREQ;
output;
end;
run;
值得注意的是,第一种方法会区分大小写,比如会分别统计‘Be’和‘be’的频率(见下图)。
当然我们可以在用PROC FREQ之前先处理好大小写的问题。第二种方法有使用正则表达式去重,所以会有点慢。当然也可以在最后使用PROC SORT去重。第二种方法同样可以用来处理统计字母的问题,程序如下:
data demo;
TEXT="It is Teacher's Day today. On this special occasion I would like to extend my heartfelt congratulations to all teachers, Happy Teacher's Day! Of all teachers who have taught me since my early childhood, the most unforgettable one is my first English teacher in college, Ms. Zhang. It is she who has aroused my keen interest in the learning of English and helped me realize the importance of self-reliance. Born into a poor farmer's family in a mountainous area and educated in relatively primitive surroundings, I found myself lagging far behind in the first class in college, which happened to be Ms. Zhang's English class. I was really discouraged and frustrated, so I decided to drop out. Ms. Zhang was so keenly insightful that she had noticed my embarrassment in class. After class, she called me into the Teacher's Room and discussed the situation with me, earnestly and kindly, citing the example of Robinson Crusoe to motivate me to go ahead in spite of all kinds of difficulties. Be a man and rely on yourself, she nudged me. The next time we met, she brought me a simplified version of Robinson Crusoe and recommended that I finish reading it in a week and write a book report. Under her consistent and patient guidance, not only has my English been greatly improved, but my confidence and courage enhanced considerably. Rely on yourself and be a man, Ms. Zhang's inspiring words have been echoing in my mind. I will work harder and try my utmost to lay a solid foundation for my future career. Only by so doing can I repay Ms. Zhang's kindness and live up to her expectations of me, that is, to become a useful person and contribute to society.";
do i=1 to 26;
CHAR=byte(i+64);
COUNT=0;
RE=prxparse('/'||CHAR||'/i');
START=1;
STOP=length(TEXT);
call prxnext(RE, START, STOP, TEXT, POSITION, LENGTH);
do while(POSITION>0);
COUNT+1;
call prxnext(RE, START, STOP, TEXT, POSITION, LENGTH);
end;
if COUNT>0 then do;
FREQ=COUNT/length(compress(prxchange('s/\W//', -1, TEXT)))*100;
output;
end;
end;
keep CHAR COUNT FREQ;
run;
结果如下:
当然,SAS有现成的函数COUNTC可以用来统计字母频率,程序如下:
data demo;
TEXT="It is Teacher's Day today. On this special occasion I would like to extend my heartfelt congratulations to all teachers, Happy Teacher's Day! Of all teachers who have taught me since my early childhood, the most unforgettable one is my first English teacher in college, Ms. Zhang. It is she who has aroused my keen interest in the learning of English and helped me realize the importance of self-reliance. Born into a poor farmer's family in a mountainous area and educated in relatively primitive surroundings, I found myself lagging far behind in the first class in college, which happened to be Ms. Zhang's English class. I was really discouraged and frustrated, so I decided to drop out. Ms. Zhang was so keenly insightful that she had noticed my embarrassment in class. After class, she called me into the Teacher's Room and discussed the situation with me, earnestly and kindly, citing the example of Robinson Crusoe to motivate me to go ahead in spite of all kinds of difficulties. Be a man and rely on yourself, she nudged me. The next time we met, she brought me a simplified version of Robinson Crusoe and recommended that I finish reading it in a week and write a book report. Under her consistent and patient guidance, not only has my English been greatly improved, but my confidence and courage enhanced considerably. Rely on yourself and be a man, Ms. Zhang's inspiring words have been echoing in my mind. I will work harder and try my utmost to lay a solid foundation for my future career. Only by so doing can I repay Ms. Zhang's kindness and live up to her expectations of me, that is, to become a useful person and contribute to society.";
do i=1 to 26;
CHAR=byte(i+64);
COUNT=countc(TEXT, CHAR, 'i');
if COUNT>0 then do;
FREQ=COUNT/length(compress(prxchange('s/\W//', -1, TEXT)))*100;
output;
end;
end;
keep CHAR COUNT FREQ;
run;