BASH编程: 计算一个文本文件中每个单词的频率

JustYY.com 小赖子的英国生活和资讯

11 years ago

LINUX 下的 SHELL 是很强大的编程工具(环境). 这里有一个例子. 在力扣/leetcode编程网站上有这么一题.

写bash脚本来计算一个文本文件中每个单词的频率 words.txt.

为了简单起见,你可以假设:

words.txt只包含小写字符和空格”字符.
每个字必须由只小写字符.
字由一个或多个空格字符分隔.
例如,假设words.txt具有以下内容:

the day is sunny the the
the sunny is is
您的脚本应该输出以下,并按降序频率:
the 4
is 3
sunny 2
day 1
注意:
不要担心处理的关系,可以保证每个单词的频率计数是独一无二的.

当然你可以完全用 BASH SHELL来写一个几行的脚本但是其实只需要通过管道就能把多个命令的结果利用起来一行就可以解决问题了.

方案- cat, tr, awk, sort

cat words.txt | tr -s ' ' '\n' | awk '{nums[$1]++}END{for(word in nums) print word, nums[word]}' | sort -rn -k2

方案 – grep, sort, uniq, sort, awk

grep -oE '[a-z]+' words.txt | sort | uniq -c | sort -r | awk '{print $2" "$1}'

方案, sed, grep, sort, uniq, sort, awk

sed -r 's/\s+/\n/g' words.txt | grep -v "^$" | sort | uniq -c | sort -r | awk '{print $2" "$1}'

方案 – awk and sort

awk '{words[$1]+=1} END{for(word in words){print word,words[word]}}' RS="[ \n]+" words.txt  | sort -nrk2

方案 cat and awk

cat words.txt | awk '{for(i=1;i<=NF;++i) { arr[$i]++; } } END { x=0; for(var in arr) {newarr[arr[var]]=var; if(arr[var]>x) x=arr[var];} for(i=x;i>0;--i) if (newarr[i] > 0) print newarr[i] " "i; }'

方案 – tr, sort, uniq, sort, awk

tr -s ' ' '\n' < words.txt|sort|uniq -c|sort -nr|awk '{print $2, $1}'

方案 – sed

cat words.txt | tr -s '[[:space:]]' '\n'| sort | uniq -c | sort -r | sed -r -e 's/[[:space:]]*([[:digit:]]+)[[:space:]]*([[:alpha:]]+)/\2 \1/g'

LINUX 命令行下有句名言: Where there is a shell, there is a way. Share on X

fork-bomb

命令拆解

上面几种方案都有一些类似. 最重要的第一步就是把文件里的单字给分离出来

sed -r 's/\s+/\n/g' words.txt

或者:

cat words.txt | tr -s ' ' '\n'

或者:

grep -oE '[a-z]+' words.txt

这些命令都会显示出单词:

the
day
is
sunny
the
the
the
sunny
is
is

然后我们可以通过 grep -v “^$” (-v 反向选择) 去掉空行. 然后排序一下就把相同的单词放一起了.

sed -r 's/\s+/\n/g' words.txt | grep -v "^$" | sort

输出:

day
is
is
is
sunny
sunny
the
the
the
the

通过命令 uniq -c 可以显示每个单词出现的次数:

      1 day
      3 is
      2 sunny
      4 the

你可以再加一个管道或者把之前 sort 命令按倒序 -r 参数.

      4 the
      3 is
      2 sunny
      1 day

最后只需要把结果导出到 awk 然后按空格读列把相应的列输出就可以了.

awk '{print $2" "$1}'

输出:

the 4
is 3
sunny 2
day 1

BASH小技巧

英文: Shell Coding Exercise: Word Frequency

强烈推荐

英国代购-畅购英伦
TopCashBack 返现 (英国购物必备, 积少成多, 我2年来一共得了3000多英镑)
Quidco 返现 (也是很不错的英国返现网站, 返现率高)
注册就送10美元, 免费使用2个月的 DigitalOcean 云主机(性价比超高, 每月只需5美元)
注册就送10美元, 免费使用4个月的 Vultr 云主机(性价比超高, 每月只需2.5美元)
注册就送10美元, 免费使用2个月的阿里云主机(性价比超高, 每月只需4.5美元)
注册就送20美元, 免费使用4个月的 Linode 云主机(性价比超高, 每月只需5美元) (折扣码: PodCastInit2022)
PlusNet 英国光纤(超快, 超划算! 用户名 doctorlai)
刷了美国运通信用卡一年得到的积分换了 485英镑
注册就送50英镑 – 英国最便宜最划算的电气提供商
能把比特币莱特币变现的银行卡! 不需要手续费就可以把虚拟货币法币兑换

微信公众号: 小赖子的英国生活和资讯 JustYYUK

阅读 桌面完整版