
python [py文件名] [难度] [处理文本对象文件名] ::python 随机挖空文本.py 1 input.md
处理文本对象文件名fileName,默认为input.md
难度difficulty,取值0、1、2、3(默认为0),分别对应挖空比例inter的25%、50%、75%、100%
分隔符seq,以分隔符作为依据划分文本,下述代码中以常见全角符号为例
排除项exc,意为忽略匹配该正则表达式的行
保留项res,意为挖空文本时,保留该正则表达式匹配的行开头prefix
import sys
import random
import re
fileName='input.md'
if(len(sys.argv)>1):
difficulty=sys.argv[1]
if(len(sys.argv)==3):
fileName=sys.argv[2]
else:
difficulty=0
#difficulty意为难度,代表挖空的比例,0:25%,1:50%,2:75%,3:100%
content=open(fileName,'r',encoding='utf-8')
with open('output.md',"w",encoding='utf-8') as f:
for line in content:
sep='[,。;!?、]'
exc='[#]'
res='[0-9+-][.]'
lineArr=re.split(sep,line)
inter=(int(difficulty)+1)*0.25
for i in range(0,len(lineArr)):
if(lineArr[i]=='n'):
continue
if(re.match(exc,line) is not None):
continue
elif(random.random()<=inter):
if(len(lineArr[i])>2 and re.match(res,lineArr[i][0:3])):
prefix=lineArr[i][0:3]
else:
prefix=''
lineArr[i]=prefix+'('+''.rjust(len(lineArr[i])).replace(' ',' ')+')'
line=','.join(lineArr)
f.write(line)
f.flush() # 写入硬盘
f.close() #关闭文件,并刷新