您的当前位置:首页正文

数据挖掘-基于贝叶斯算法及KNN算法

2022-06-15 来源:知库网


数据挖掘-基于贝叶斯算法及KNN算法

数据挖掘-基于贝叶斯算法及KNN算法的newsgroup18828文档分类器的JAVA实现(上) 本分类器的完整工程可以到点击打开链接下载,详细说明的运行方法,用eclipse可以运行,学习数据挖掘的朋友可以跑一下,有问题可以联系我,欢迎交流:)

上文中描述了newsgroup18828文档集的预处理及贝叶斯算法的JAVA实现,下面我们来看看如何实现基于KNN算法的newsgroup文本分类器

1 KNN算法的描述 KNN算法描述如下:

STEP ONE:文本向量化表示,由特征词的TF*IDF值计算

STEP TWO:在新文本到达后,根据特征词确定新文本的向量

STEP THREE:在训练文本集中选出与新文本最相似的 K 个文本,相似度用向量夹角余弦度量,计算公式为:

吸氧机,家用吸氧机价格

其中,K 值的确定目前没有很好的方法,一般采用先定一个初始值,然后根据实验测试的结果调整 K 值 本项目中K取20

STEP FOUR:在新文本的 K 个邻居中,依次计算每类的权重,每类的权重等于K个邻居中属于该类的训练样本与测试样本的相似度之和。 STEP FIVE:比较类的权重,将文本分到权重最大的那个类别中。

2 文档TF-IDF计算及向量化表示

实现KNN算法首先要实现文档的向量化表示 计算特征词的TF*IDF,每个文档的向量由包含所有特征词的TF*IDF值组成,每一维对应一个特征词

TF及IDF的计算公式如下,分别为特征词的特征项频率和逆文档频率

制氧机,鱼跃家庭制氧机

吸氧机,家用吸氧机价格

文档向量计算类 ComputeWordsVector.java如下

1. package com.pku.yangliu; 2. import java.io.BufferedReader; 3. import java.io.File; 4. import java.io.FileReader; 5. import java.io.FileWriter; 6. import java.io.IOException; 7. import java.util.SortedMap; 8. import java.util.Map; 9. import java.util.Set; 10. import java.util.TreeMap; 11. import java.util.Iterator; 12.

13. /**计算文档的属性向量,将所有文档向量化 14. * @author yangliu 15. * @qq 772330184

16. * @mail yang.liu@pku.edu.cn 17. * 18. */

19. public class ComputeWordsVector { 20.

21. /**计算文档的TF属性向量,直接写成二维数组遍历形式即可,没必要递归 22. * @param strDir 处理好的newsgroup文件目录的绝对路径 23. * @param trainSamplePercent 训练样例集占每个类目的比例 24. * @param indexOfSample 测试样例集的起始的测试样例编号 25. * @param wordMap 属性词典map 26. * @throws IOException 27. */

28. public void computeTFMultiIDF(String strDir, double trainSamplePercent,

int indexOfSample, Map iDFPerWordMap, Map wordMap) throws IOException{

制氧机,鱼跃家庭制氧机

吸氧机,家用吸氧机价格

29. File fileDir = new File(strDir); 30. String word;

31. SortedMap TFPerDocMap = new TreeMap();

32. //注意可以用两个写文件,一个专门写测试样例,一个专门写训练样例,用sampleType

的值来表示

33. String trainFileDir = \"F:/DataMiningSample/docVector/wordTFIDFMapTra

inSample\"+indexOfSample;

34. String testFileDir = \"F:/DataMiningSample/docVector/wordTFIDFMapTest

Sample\"+indexOfSample;

35. FileWriter tsTrainWriter = new FileWriter(new File(trainFileDir)); 36. FileWriter tsTestWrtier = new FileWriter(new File(testFileDir)); 37. FileWriter tsWriter = tsTrainWriter; 38. File[] sampleDir = fileDir.listFiles(); 39. for(int i = 0; i < sampleDir.length; i++){

40. String cateShortName = sampleDir[i].getName(); 41. System.out.println(\"compute: \" + cateShortName); 42. File[] sample = sampleDir[i].listFiles();

43. double testBeginIndex = indexOfSample*(sample.length * (1-trainS

amplePercent));//测试样例的起始文件序号

44. double testEndIndex = (indexOfSample+1)*(sample.length * (1-trai

nSamplePercent));//测试样例集的结束文件序号

45. System.out.println(\"dirName_total length:\"+sampleDir[i].getCanon

icalPath()+\"_\"+sample.length);

46. System.out.println(trainSamplePercent + \" length:\"+sample.length

* trainSamplePercent +\" testBeginIndex:\"+testBeginIndex+\" testEndIndex\"+ testEndIndex);

47. for(int j = 0;j < sample.length; j++){ 48. TFPerDocMap.clear();

49. FileReader samReader = new FileReader(sample[j]); 50. BufferedReader samBR = new BufferedReader(samReader); 51. String fileShortName = sample[j].getName(); 52. Double wordSumPerDoc = 0.0;//计算每篇文档的总词数 53. while((word = samBR.readLine()) != null){

54. if(!word.isEmpty() && wordMap.containsKey(word)){//必须是

属性词典里面的词,去掉的词不考虑

55. wordSumPerDoc++;

56. if(TFPerDocMap.containsKey(word)){

57. Double count = TFPerDocMap.get(word); 58. TFPerDocMap.put(word, count + 1); 59. } 60. else {

61. TFPerDocMap.put(word, 1.0); 62. }

制氧机,鱼跃家庭制氧机

吸氧机,家用吸氧机价格

63. } 64. }

65. //遍历一下当前文档的TFmap,除以文档的总词数换成词频,然后将词频乘以

词的IDF,得到最终的特征权值,并且输出到文件

66. //注意测试样例和训练样例写入的文件不同

67. if(j >= testBeginIndex && j <= testEndIndex){ 68. tsWriter = tsTestWrtier; 69. } 70. else{

71. tsWriter = tsTrainWriter; 72. }

73. Double wordWeight;

74. Set> tempTF = TFPerDocMap.entrySet

();

75. for(Iterator> mt = tempTF.iterator

(); mt.hasNext();){

76. Map.Entry me = mt.next();

77. //wordWeight = (me.getValue() / wordSumPerDoc) * IDFPer

WordMap.get(me.getKey());

78. //这里IDF暂时设为1,具体的计算IDF算法改进和实现见我的博客中

关于kmeans聚类的博文

79. wordWeight = (me.getValue() / wordSumPerDoc) * 1.0; 80. TFPerDocMap.put(me.getKey(), wordWeight); 81. }

82. tsWriter.append(cateShortName + \" \");

83. String keyWord = fileShortName.substring(0,5); 84. tsWriter.append(keyWord+ \" \");

85. Set> tempTF2 = TFPerDocMap.entrySe

t();

86. for(Iterator> mt = tempTF2.iterato

r(); mt.hasNext();){

87. Map.Entry ne = mt.next();

88. tsWriter.append(ne.getKey() + \" \" + ne.getValue() + \" \")

;

89. }

90. tsWriter.append(\"\\n\"); 91. tsWriter.flush(); 92. } 93. }

94. tsTrainWriter.close(); 95. tsTestWrtier.close(); 96. tsWriter.close(); 97. } 98.

制氧机,鱼跃家庭制氧机

吸氧机,家用吸氧机价格

99. /**统计每个词的总的出现次数,返回出现次数大于3次的词汇构成最终的属性词典 100. * @param strDir 处理好的newsgroup文件目录的绝对路径 101. * @throws IOException 102. */

103. public SortedMap countWords(String strDir,Mapuble> wordMap) throws IOException{

104. File sampleFile = new File(strDir); 105. File [] sample = sampleFile.listFiles(); 106. String word;

107. for(int i = 0; i < sample.length; i++){ 108. if(!sample[i].isDirectory()){

109. if(sample[i].getName().contains(\"stemed\")){

110. FileReader samReader = new FileReader(sample[i]); 111. BufferedReader samBR = new BufferedReader(samReader); 112. while((word = samBR.readLine()) != null){

113. if(!word.isEmpty() && wordMap.containsKey(word)){ 114. double count = wordMap.get(word) + 1; 115. wordMap.put(word, count); 116. } 117. else {

118. wordMap.put(word, 1.0); 119. } 120. } 121. } 122. }

123. else countWords(sample[i].getCanonicalPath(),wordMap); 124. }

125. //只返回出现次数大于3的单词

126. SortedMap newWordMap = new TreeMap();

127. Set> allWords = wordMap.entrySet(); 128. for(Iterator> it = allWords.iterator(); it

.hasNext();){

129. Map.Entry me = it.next(); 130. if(me.getValue() >= 1){

131. newWordMap.put(me.getKey(),me.getValue()); 132. } 133. }

134. return newWordMap; 135. } 136.

137. /**打印属性词典

138. * @param SortedMap 属性词典 139. * @throws IOException

制氧机,鱼跃家庭制氧机

吸氧机,家用吸氧机价格

140. */

141. void printWordMap(Map wordMap) throws IOException { 142. // TODO Auto-generated method stub 143. System.out.println(\"printWordMap\"); 144. int countLine = 0;

145. File outPutFile = new File(\"F:/DataMiningSample/docVector/allDicWor

dCountMap.txt\");

146. FileWriter outPutFileWriter = new FileWriter(outPutFile); 147. Set> allWords = wordMap.entrySet(); 148. for(Iterator> it = allWords.iterator(); it

.hasNext();){

149. Map.Entry me = it.next();

150. outPutFileWriter.write(me.getKey()+\" \"+me.getValue()+\"\\n\"); 151. countLine++; 152. }

153. System.out.println(\"WordMap size\" + countLine); 154. } 155.

156. /**计算IDF,即属性词典中每个词在多少个文档中出现过 157. * @param SortedMap 属性词典 158. * @return 单词的IDFmap 159. * @throws IOException 160. */

161. SortedMap computeIDF(String string, Map

wordMap) throws IOException {

162. // TODO Auto-generated method stub 163. File fileDir = new File(string); 164. String word;

165. SortedMap IDFPerWordMap = new TreeMap

();

166. Set> wordMapSet = wordMap.entrySet(); 167. for(Iterator> pt = wordMapSet.iterator();

pt.hasNext();){

168. Map.Entry pe = pt.next(); 169. Double coutDoc = 0.0;

170. String dicWord = pe.getKey();

171. File[] sampleDir = fileDir.listFiles(); 172. for(int i = 0; i < sampleDir.length; i++){ 173. File[] sample = sampleDir[i].listFiles(); 174. for(int j = 0;j < sample.length; j++){

175. FileReader samReader = new FileReader(sample[j]); 176. BufferedReader samBR = new BufferedReader(samReader); 177. boolean isExited = false;

178. while((word = samBR.readLine()) != null){

制氧机,鱼跃家庭制氧机

吸氧机,家用吸氧机价格

179. if(!word.isEmpty() && word.equals(dicWord)){ 180. isExited = true; 181. break; 182. } 183. }

184. if(isExited) coutDoc++; 185. } 186. }

187. //计算单词的IDF

188. Double IDF = Math.log(20000 / coutDoc) / Math.log(10); 189. IDFPerWordMap.put(dicWord, IDF); 190. }

191. return IDFPerWordMap; 192. } 193. }

3 KNN算法的实现 KNN算法的实现要注意 (1)用

TreeMap>保存测试集和训练集

(2)注意要以\"类目_文件名\"作为每个文件的key,才能避免同名不同内容的文件出现 (3)注意设置JM参数,否则会出现JAVA heap溢出错误

(4)本程序用向量夹角余弦计算相似度 KNN算法实现类 KNNClassifier.java如下

1. package com.pku.yangliu; 2.

3. import java.io.BufferedReader;

制氧机,鱼跃家庭制氧机

吸氧机,家用吸氧机价格

4. import java.io.File; 5. import java.io.FileReader; 6. import java.io.FileWriter; 7. import java.io.IOException; 8. import java.util.Comparator; 9. import java.util.HashMap; 10. import java.util.Iterator; 11. import java.util.Map; 12. import java.util.Set; 13. import java.util.TreeMap; 14.

15. /**KNN算法的实现类,本程序用向量夹角余弦计算相似度 16. * @author yangliu 17. * @qq 772330184

18. * @mail yang.liu@pku.edu.cn 19. * 20. */ 21.

22. public class KNNClassifier { 23.

24. /**用KNN算法对测试文档集分类,读取测试样例和训练样例集 25. * @param trainFiles 训练样例的所有向量构成的文件 26. * @param testFiles 测试样例的所有向量构成的文件 27. * @param kNNResultFile KNN分类结果文件路径 28. * @return double 分类准确率 29. * @throws IOException 30. */

31. private double doProcess(String trainFiles, String testFiles, 32. String kNNResultFile) throws IOException { 33. // TODO Auto-generated method stub

34. //首先读取训练样本和测试样本,用map>保存测试集和训练

集,注意训练样本的类目信息也得保存,

35. //然后遍历测试样本,对于每一个测试样本去计算它与所有训练样本的相似度,相似度

保存入map

36. //序map中去,然后取前K个样本,针对这k个样本来给它们所属的类目计算权重得

分,对属于同一个类目的权重求和进而得到

37. //最大得分的类目,就可以判断测试样例属于该类目下,K值可以反复测试,找到分类

准确率最高的那个值

38. //!注意要以\"类目_文件名\"作为每个文件的key,才能避免同名不同内容的文件出

39. //!注意设置JM参数,否则会出现JAVA heap溢出错误 40. //!本程序用向量夹角余弦计算相似度

41. File trainSamples = new File(trainFiles);

制氧机,鱼跃家庭制氧机

吸氧机,家用吸氧机价格

42. BufferedReader trainSamplesBR = new BufferedReader(new FileReader(tr

ainSamples)); 43. String line;

44. String [] lineSplitBlock;

45. Map> trainFileNameWordTFMap = new Tree

Map> ();

46. TreeMap trainWordTFMap = new TreeMap()

;

47. while((line = trainSamplesBR.readLine()) != null){ 48. lineSplitBlock = line.split(\" \"); 49. trainWordTFMap.clear();

50. for(int i = 2; i < lineSplitBlock.length; i = i + 2){

51. trainWordTFMap.put(lineSplitBlock[i], Double.valueOf(lineSpl

itBlock[i+1])); 52. }

53. TreeMap tempMap = new TreeMap(); 54. tempMap.putAll(trainWordTFMap);

55. trainFileNameWordTFMap.put(lineSplitBlock[0]+\"_\"+lineSplitBlock[

1], tempMap); 56. }

57. trainSamplesBR.close(); 58.

59. File testSamples = new File(testFiles);

60. BufferedReader testSamplesBR = new BufferedReader(new FileReader(tes

tSamples));

61. Map> testFileNameWordTFMap = new TreeMaptring,Map> ();

62. Map testClassifyCateMap = new TreeMap

();//分类形成的<文件名,类目>对

63. Map testWordTFMap = new TreeMap(); 64. while((line = testSamplesBR.readLine()) != null){ 65. lineSplitBlock = line.split(\" \"); 66. testWordTFMap.clear();

67. for(int i = 2; i < lineSplitBlock.length; i = i + 2){

68. testWordTFMap.put(lineSplitBlock[i], Double.valueOf(lineSpli

tBlock[i+1])); 69. }

70. TreeMap tempMap = new TreeMap(); 71. tempMap.putAll(testWordTFMap);

72. testFileNameWordTFMap.put(lineSplitBlock[0]+\"_\"+lineSplitBlock[1

], tempMap); 73. }

74. testSamplesBR.close();

75. //下面遍历每一个测试样例计算与所有训练样本的距离,做分类

制氧机,鱼跃家庭制氧机

吸氧机,家用吸氧机价格

76. String classifyResult;

77. FileWriter testYangliuWriter = new FileWriter(new File(\"F:/DataMinin

gSample/docVector/yangliuTest\"));

78. FileWriter KNNClassifyResWriter = new FileWriter(kNNResultFile); 79. Set>> testFileNameWordTFMapSet =

testFileNameWordTFMap.entrySet();

80. for(Iterator>> it = testFileName

WordTFMapSet.iterator(); it.hasNext();){

81. Map.Entry> me = it.next();

82. classifyResult = KNNComputeCate(me.getKey(), me.getValue(), trai

nFileNameWordTFMap, testYangliuWriter);

83. KNNClassifyResWriter.append(me.getKey()+\" \"+classifyResult+\"\\n\")

;

84. KNNClassifyResWriter.flush();

85. testClassifyCateMap.put(me.getKey(), classifyResult); 86. }

87. KNNClassifyResWriter.close(); 88. //计算分类的准确率 89. double righteCount = 0;

90. Set> testClassifyCateMapSet = testClassify

CateMap.entrySet();

91. for(Iterator > it = testClassifyCateMapSet

.iterator(); it.hasNext();){

92. Map.Entry me = it.next(); 93. String rightCate = me.getKey().split(\"_\")[0]; 94. if(me.getValue().equals(rightCate)){ 95. righteCount++; 96. } 97. }

98. testYangliuWriter.close();

99. return righteCount / testClassifyCateMap.size(); 100. } 101.

102. /**对于每一个测试样本去计算它与所有训练样本的向量夹角余弦相似度 103. * 相似度保存入map有序map中去,然后取前K个样本, 104. * 针对这k个样本来给它们所属的类目计算权重得分,对属于同一个类 105. * 目的权重求和进而得到最大得分的类目,就可以判断测试样例属于该 106. * 类目下。K值可以反复测试,找到分类准确率最高的那个值 107. * @param testWordTFMap 当前测试文件的<单词,词频>向量

108. * @param trainFileNameWordTFMap 训练样本<类目_文件名,向量>Map 109. * @param testYangliuWriter

110. * @return String K个邻居权重得分最大的类目 111. * @throws IOException 112. */

制氧机,鱼跃家庭制氧机

吸氧机,家用吸氧机价格

113. private String KNNComputeCate( 114. String testFileName,

115. Map testWordTFMap,

116. Map> trainFileNameWordTFMap, Fi

leWriter testYangliuWriter) throws IOException { 117. // TODO Auto-generated method stub

118. HashMap simMap = new HashMap();//<类

目_文件名,距离> 后面需要将该HashMap按照value排序 119. double similarity;

120. Set>> trainFileNameWordTFMa

pSet = trainFileNameWordTFMap.entrySet();

121. for(Iterator>> it = trainFi

leNameWordTFMapSet.iterator(); it.hasNext();){

122. Map.Entry> me = it.next(); 123. similarity = computeSim(testWordTFMap, me.getValue()); 124. simMap.put(me.getKey(),similarity); 125. }

126. //下面对simMap按照value排序

127. ByValueComparator bvc = new ByValueComparator(simMap);

128. TreeMap sortedSimMap = new TreeMap(bv

c);

129. sortedSimMap.putAll(simMap); 130.

131. //在disMap中取前K个最近的训练样本对其类别计算距离之和,K的值通过反复试

验而得

132. Map cateSimMap = new TreeMap();//K个

最近训练样本所属类目的距离之和 133. double K = 20; 134. double count = 0; 135. double tempSim; 136.

137. Set> simMapSet = sortedSimMap.entrySet();

138. for(Iterator> it = simMapSet.iterator();

it.hasNext();){

139. Map.Entry me = it.next(); 140. count++;

141. String categoryName = me.getKey().split(\"_\")[0]; 142. if(cateSimMap.containsKey(categoryName)){ 143. tempSim = cateSimMap.get(categoryName);

144. cateSimMap.put(categoryName, tempSim + me.getValue()); 145. }

146. else cateSimMap.put(categoryName, me.getValue()); 147. if (count > K) break;

制氧机,鱼跃家庭制氧机

吸氧机,家用吸氧机价格

148. }

149. //下面到cateSimMap里面把sim最大的那个类目名称找出来 150. //testYangliuWriter.flush(); 151. //testYangliuWriter.close(); 152. double maxSim = 0; 153. String bestCate = null;

154. Set> cateSimMapSet = cateSimMap.entrySet(

);

155. for(Iterator> it = cateSimMapSet.iterator

(); it.hasNext();){

156. Map.Entry me = it.next(); 157. if(me.getValue()> maxSim){ 158. bestCate = me.getKey(); 159. maxSim = me.getValue(); 160. } 161. }

162. return bestCate; 163. } 164.

165. /**计算测试样本向量和训练样本向量的相似度

166. * @param testWordTFMap 当前测试文件的<单词,词频>向量 167. * @param trainWordTFMap 当前训练样本<单词,词频>向量 168. * @return Double 向量之间的相似度 以向量夹角余弦计算 169. * @throws IOException 170. */

171. private double computeSim(Map testWordTFMap, 172. Map trainWordTFMap) { 173. // TODO Auto-generated method stub

174. double mul = 0, testAbs = 0, trainAbs = 0;

175. Set> testWordTFMapSet = testWordTFMap.ent

rySet();

176. for(Iterator> it = testWordTFMapSet.itera

tor(); it.hasNext();){

177. Map.Entry me = it.next(); 178. if(trainWordTFMap.containsKey(me.getKey())){

179. mul += me.getValue()*trainWordTFMap.get(me.getKey()); 180. }

181. testAbs += me.getValue() * me.getValue(); 182. }

183. testAbs = Math.sqrt(testAbs); 184.

185. Set> trainWordTFMapSet = trainWordTFMap.e

ntrySet();

制氧机,鱼跃家庭制氧机

吸氧机,家用吸氧机价格

186. for(Iterator> it = trainWordTFMapSet.iter

ator(); it.hasNext();){

187. Map.Entry me = it.next(); 188. trainAbs += me.getValue()*me.getValue(); 189. }

190. trainAbs = Math.sqrt(trainAbs); 191. return mul / (testAbs * trainAbs); 192. } 193.

194. /**根据KNN算法分类结果文件生成正确类目文件,而正确率和混淆矩阵的计算可以复用

贝叶斯算法类中的方法

195. * @param kNNRightFile 分类正确类目文件 196. * @param kNNResultFile 分类结果文件 197. * @throws IOException 198. */

199. private void createRightFile(String kNNResultFile, String kNNRightFile)

throws IOException {

200. // TODO Auto-generated method stub 201. String rightCate;

202. FileReader fileR = new FileReader(kNNResultFile);

203. FileWriter KNNRrightResult = new FileWriter(new File(kNNRightFile))

;

204. BufferedReader fileBR = new BufferedReader(fileR); 205. String line; 206. String lineBlock[];

207. while((line = fileBR.readLine()) != null){ 208. lineBlock = line.split(\" \");

209. rightCate = lineBlock[0].split(\"_\")[0];

210. KNNRrightResult.append(lineBlock[0]+\" \"+rightCate+\"\\n\"); 211. }

212. KNNRrightResult.flush(); 213. KNNRrightResult.close(); 214. } 215. 216. 217. /**

218. * @param args

219. * @throws IOException 220. */

221. public void KNNClassifierMain(String[] args) throws IOException { 222. // TODO Auto-generated method stub

223. //wordMap是所有属性词的词典<单词,在所有文档中出现的次数> 224. double[] accuracyOfEveryExp = new double[10]; 225. double accuracyAvg,sum = 0;

制氧机,鱼跃家庭制氧机

吸氧机,家用吸氧机价格

226. KNNClassifier knnClassifier = new KNNClassifier();

227. NaiveBayesianClassifier nbClassifier = new NaiveBayesianClassifier(

);

228. Map wordMap = new TreeMap(); 229. Map IDFPerWordMap = new TreeMap();

230. ComputeWordsVector computeWV = new ComputeWordsVector();

231. wordMap = computeWV.countWords(\"F:/DataMiningSample/processedSample

OnlySpecial\", wordMap);

232. IDFPerWordMap = computeWV.computeIDF(\"F:/DataMiningSample/processed

SampleOnlySpecial\",wordMap);

233. computeWV.printWordMap(wordMap);

234. //首先生成KNN算法10次试验需要的文档TF矩阵文件 235. for(int i = 0; i < 10; i++){

236. computeWV.computeTFMultiIDF(\"F:/DataMiningSample/processedSampl

eOnlySpecial\",0.9, i, IDFPerWordMap,wordMap);

237. String trainFiles = \"F:/DataMiningSample/docVector/wordTFIDFMap

TrainSample\"+i;

238. String testFiles = \"F:/DataMiningSample/docVector/wordTFIDFMapT

estSample\"+i;

239. String kNNResultFile = \"F:/DataMiningSample/docVector/KNNClassi

fyResult\"+i;

240. String kNNRightFile = \"F:/DataMiningSample/docVector/KNNClassif

yRight\"+i;

241. accuracyOfEveryExp[i] = knnClassifier.doProcess(trainFiles, tes

tFiles, kNNResultFile);

242. knnClassifier.createRightFile(kNNResultFile,kNNRightFile); 243. accuracyOfEveryExp[i] = nbClassifier.computeAccuracy(kNNResultF

ile, kNNRightFile);//计算准确率复用贝叶斯算法中的方法 244. sum += accuracyOfEveryExp[i];

245. System.out.println(\"The accuracy for KNN Classifier in \"+i+\"th

Exp is :\" + accuracyOfEveryExp[i]); 246. }

247. accuracyAvg = sum / 10;

248. System.out.println(\"The average accuracy for KNN Classifier in all

Exps is :\" + accuracyAvg); 249. } 250.

251. //对HashMap按照value做排序

252. static class ByValueComparator implements Comparator { 253. HashMap base_map; 254.

255. public ByValueComparator(HashMap disMap) { 256. this.base_map = disMap;

制氧机,鱼跃家庭制氧机

吸氧机,家用吸氧机价格

257. } 258.

259. @Override

260. public int compare(Object o1, Object o2) { 261. // TODO Auto-generated method stub 262. String arg0 = o1.toString(); 263. String arg1 = o2.toString();

264. if (!base_map.containsKey(arg0) || !base_map.containsKey(arg1))

{

265. return 0; 266. }

267. if (base_map.get(arg0) < base_map.get(arg1)) { 268. return 1;

269. } else if (base_map.get(arg0) == base_map.get(arg1)) { 270. return 0; 271. } else { 272. return -1; 273. } 274. } 275. } 276. }

分类器主类

1. package com.pku.yangliu; 2.

3. /**分类器主分类,依次执行数据预处理、朴素贝叶斯分类、KNN分类 4. * @author yangliu 5. * @qq 772330184

6. * @mail yang.liu@pku.edu.cn 7. * 8. */

9. public class ClassifierMain { 10.

11. public static void main(String[] args) throws Exception { 12. // TODO Auto-generated method stub

13. DataPreProcess DataPP = new DataPreProcess();

14. NaiveBayesianClassifier nbClassifier = new NaiveBayesianClassifier()

;

15. KNNClassifier knnClassifier = new KNNClassifier(); 16. //DataPP.BPPMain(args);

17. nbClassifier.NaiveBayesianClassifierMain(args); 18. knnClassifier.KNNClassifierMain(args); 19. }

制氧机,鱼跃家庭制氧机

吸氧机,家用吸氧机价格

20. }

5 KNN算法的分类结果

用混淆矩阵表示如下,第6次实验准确率达到82.10%

程序运行环境硬件环境:Intel Core 2 Duo CPU T5750 2GHZ, 2G内存,相同硬件环境计算和贝叶斯算法做对比

制氧机,鱼跃家庭制氧机

吸氧机,家用吸氧机价格

实验结果如上所示 取出现次数大于等于4次的词共计30095个作为特征词: 10次交叉验证实验平均准确率78.19%,用时1h55min,10词实验准确率范围73.62%-82.10%,其中有3次实验准确率超过80%

6 朴素贝叶斯与KNN分类准确率对比 取出现次数大于等于4次的词共计30095个作为特征词,做10次交叉验证实验,朴素贝叶斯和KNN算法对Newsgroup文档分类结果对比:

制氧机,鱼跃家庭制氧机

吸氧机,家用吸氧机价格

点击打开链接

结论

分类准确率上,KNN算法更优 分类速度上,朴素贝叶斯算法更优

制氧机,鱼跃家庭制氧机

因篇幅问题不能全部显示,请点此查看更多更全内容