小赖子的英国生活和资讯

强大的 LINUX BASH SHELL

阅读 桌面完整版

LINUX 下的 SHELL 是很强大的编程工具(环境). 这里有一个例子. 在 LEETCODE 编程网站上有这么一题.

写bash脚本来计算一个文本文件中每个单词的频率 words.txt.

为了简单起见,你可以假设:

words.txt只包含小写字符和空格”字符.
每个字必须由只小写字符.
字由一个或多个空格字符分隔.
例如,假设words.txt具有以下内容:

the day is sunny the the
the sunny is is
您的脚本应该输出以下,并按降序频率:
the 4
is 3
sunny 2
day 1
注意:
不要担心处理的关系,可以保证每个单词的频率计数是独一无二的.

提交您的解决方案在Leetcode: https://leetcode.com/problems/word-frequency/

当然你可以完全用 BASH SHELL来写一个几行的脚本 但是 其实只需要通过管道 就能把多个命令的结果利用起来 一行就可以解决问题了.

方案- cat, tr, awk, sort

1
cat words.txt | tr -s ' ' '\n' | awk '{nums[$1]++}END{for(word in nums) print word, nums[word]}' | sort -rn -k2
cat words.txt | tr -s ' ' '\n' | awk '{nums[$1]++}END{for(word in nums) print word, nums[word]}' | sort -rn -k2

方案 – grep, sort, uniq, sort, awk

1
grep -oE '[a-z]+' words.txt | sort | uniq -c | sort -r | awk '{print $2" "$1}'
grep -oE '[a-z]+' words.txt | sort | uniq -c | sort -r | awk '{print $2" "$1}'

方案, sed, grep, sort, uniq, sort, awk

1
sed -r 's/\s+/\n/g' words.txt | grep -v "^$" | sort | uniq -c | sort -r | awk '{print $2" "$1}'
sed -r 's/\s+/\n/g' words.txt | grep -v "^$" | sort | uniq -c | sort -r | awk '{print $2" "$1}'

方案 – awk and sort

1
awk '{words[$1]+=1} END{for(word in words){print word,words[word]}}' RS="[ \n]+" words.txt  | sort -nrk2
awk '{words[$1]+=1} END{for(word in words){print word,words[word]}}' RS="[ \n]+" words.txt  | sort -nrk2

方案 cat and awk

1
cat words.txt | awk '{for(i=1;i<=NF;++i) { arr[$i]++; } } END { x=0; for(var in arr) {newarr[arr[var]]=var; if(arr[var]>x) x=arr[var];} for(i=x;i>0;--i) if (newarr[i] > 0) print newarr[i] " "i; }'
cat words.txt | awk '{for(i=1;i<=NF;++i) { arr[$i]++; } } END { x=0; for(var in arr) {newarr[arr[var]]=var; if(arr[var]>x) x=arr[var];} for(i=x;i>0;--i) if (newarr[i] > 0) print newarr[i] " "i; }'

方案 – tr, sort, uniq, sort, awk

1
tr -s ' ' '\n' < words.txt|sort|uniq -c|sort -nr|awk '{print $2, $1}'
tr -s ' ' '\n' < words.txt|sort|uniq -c|sort -nr|awk '{print $2, $1}'

方案 - sed

1
cat words.txt | tr -s '[[:space:]]' '\n'| sort | uniq -c | sort -r | sed -r -e 's/[[:space:]]*([[:digit:]]+)[[:space:]]*([[:alpha:]]+)/\2 \1/g'
cat words.txt | tr -s '[[:space:]]' '\n'| sort | uniq -c | sort -r | sed -r -e 's/[[:space:]]*([[:digit:]]+)[[:space:]]*([[:alpha:]]+)/\2 \1/g'

LINUX 命令行下有句名言: Where there is a shell, there is a way. Click To Tweet

fork-bomb

命令拆解

上面几种方案都有一些类似. 最重要的第一步就是把文件里的单字给分离出来

1
sed -r 's/\s+/\n/g' words.txt
sed -r 's/\s+/\n/g' words.txt

或者

1
cat words.txt | tr -s ' ' '\n'
cat words.txt | tr -s ' ' '\n'

或者

1
grep -oE '[a-z]+' words.txt
grep -oE '[a-z]+' words.txt

这些命令都会显示出单词

1
2
3
4
5
6
7
8
9
10
the
day
is
sunny
the
the
the
sunny
is
is
the
day
is
sunny
the
the
the
sunny
is
is

然后我们可以通过 grep -v “^$” (-v 反向选择) 去掉空行. 然后排序一下就把相同的单词放一起了.

1
sed -r 's/\s+/\n/g' words.txt | grep -v "^$" | sort
sed -r 's/\s+/\n/g' words.txt | grep -v "^$" | sort

输出:

1
2
3
4
5
6
7
8
9
10
day
is
is
is
sunny
sunny
the
the
the
the
day
is
is
is
sunny
sunny
the
the
the
the

通过命令 uniq -c 可以显示每个单词出现的次数:

1
2
3
4
      1 day
      3 is
      2 sunny
      4 the
      1 day
      3 is
      2 sunny
      4 the

你可以再加一个管道或者把之前 sort 命令按倒序 -r 参数.

1
2
3
4
      4 the
      3 is
      2 sunny
      1 day
      4 the
      3 is
      2 sunny
      1 day

最后只需要 把结果导出到 awk 然后按空格读列 把相应的列输出就可以了.

1
awk '{print $2" "$1}'
awk '{print $2" "$1}'

输出:

1
2
3
4
the 4
is 3
sunny 2
day 1
the 4
is 3
sunny 2
day 1

英文: https://helloacm.com/shell-coding-exercise-word-frequency/

强烈推荐

微信公众号: 小赖子的英国生活和资讯 JustYYUK

阅读 桌面完整版
Exit mobile version