{"id":64,"date":"2015-09-21T13:04:24","date_gmt":"2015-09-21T13:04:24","guid":{"rendered":"http:\/\/daplab.ch\/?p=64"},"modified":"2015-10-07T12:26:30","modified_gmt":"2015-10-07T12:26:30","slug":"spark-mllib-hello-world","status":"publish","type":"post","link":"https:\/\/daplab.ch\/2015\/09\/21\/spark-mllib-hello-world\/","title":{"rendered":"Spark MLlib Hello World"},"content":{"rendered":"<p>This page aims at creating a &#8220;copy-paste&#8221;-like tutorial to run your first Spark MLlib script.<\/p>\n<h2>Requirements<\/h2>\n<ul>\n<li>SSH (for Windows, use <a class=\"external text\" href=\"http:\/\/www.chiark.greenend.org.uk\/~sgtatham\/putty\/download.html\" rel=\"nofollow\">PuTTY<\/a> and see <a class=\"external text\" href=\"https:\/\/www.digitalocean.com\/community\/tutorials\/how-to-use-ssh-keys-with-putty-on-digitalocean-droplets-windows-users\" rel=\"nofollow\">how to create a key with PuTTY<\/a>)<\/li>\n<li>An account in the <a class=\"external text\" href=\"http:\/\/daplab.ch\/\" rel=\"nofollow\">DAPLAB<\/a>, and send your ssh public key to Benoit.<\/li>\n<li>A browser &#8212; well, if you can access this page, you should have met this requirement\u00a0\ud83d\ude42<\/li>\n<li>Spark &#8212; please refer to <a title=\"SparkHelloWorld\" href=\"http:\/\/daplab.ch\/2015\/09\/21\/spark-hello-world\/\">SparkHelloWorld<\/a> for more details about Spark.<\/li>\n<\/ul>\n<h2>Resources<\/h2>\n<p>This tutorial us heavily inspired from an exercise of the book <a class=\"external text\" href=\"https:\/\/www.packtpub.com\/big-data-and-business-intelligence\/machine-learning-spark\" rel=\"nofollow\">Machine Learning with Spark<\/a> from Nick Pentreath.<\/p>\n<h2>Running Spark<\/h2>\n<p>Spark version provided with HDP 2.2.4<\/p>\n<p>We already have some datasets in <code>hdfs:\/\/shared<\/code>, we&#8217;ll use this file to save some time (and some space)<\/p>\n<p>Blabla MLlib on Spark<\/p>\n<p>Data is available at<\/p>\n<pre>val path=\"hdfs:\/\/daplab-rt-12:8020\/shared\/20_newsgroups\/*\"\r\nval rdd = sc.wholeTextFiles(path)\r\nval newsgroups = rdd.map { case (file, text) =&gt; file.split(\"\/\").takeRight(2).head }\r\nval countByGroups = newsgroups.map(n =&gt; (n, 1)).reduceByKey(_ + _).collect.sortBy(-_._2).mkString(\"\\n\")\r\n\/\/ Tokenizing the text data\r\nval text = rdd.map { case (file, text) =&gt; text }\r\nval whiteSpaceSplit = text.flatMap(t =&gt; t.split(\" \").map(_,toLowerCase))\r\nval nonWordSplit = text.flatMap(t =&gt; t.split(\"\"\"\\W+\"\"\").map(_.toLowerCase))\r\nval regex = \"\"\"[^0-9]*\"\"\".r\r\nval filterNumbers = nonWordSplit.filter(token =&gt; regex.pattern.matcher(token).matches)\r\nval tokenCounts = filterNumbers.map(t =&gt; (t, 1)).reduceByKey(_ + _)\r\nval stopwords = Set (\"the\",\"a\",\"an\",\"of\",\"or\",\"in\",\"for\",\"by\",\"on\",\"but\",\"is\",\"not\",\"ith\",\"as\",\"was\")\r\nval tokenCountsFilteredStopwords = tokenCounts.filter { case (k, v) =&gt;\u00a0!stopwords.contains(k) }\r\nval tokenCountsFilteredSize = tokenCountsFilteredStopwords.filter { case (k, v) =&gt; k.size &gt;=2 }\r\n\/\/ filter out rare tokens with total occurence &lt; 2\r\nval rareTokens = tokenCounts.filter{ case (k, v) =&gt; v &lt; 2 }.map { case (k, v) =&gt; k }.collect.toSet\r\nval tokenCountsFilteredAll = tokenCountsFilteredSize.filter { case (k, v) =&gt;\u00a0!rareTokens.contains(k) }\r\n\r\n<\/pre>\n<pre>\/\/ create a function to tokenize each document\r\ndef tokenize(line: String): Seq[String] = {\r\n    line.split(\"\"\"\\W+\"\"\")\r\n        .map(_.toLowerCase)\r\n        .filter(token =&gt; regex.pattern.matcher(token).matches)\r\n        .filterNot(token =&gt; stopwords.contains(token))\r\n        .filterNot(token =&gt; rareTokens.contains(token))\r\n        .filter(token =&gt; token.size &gt;=2)\r\n        .toSeq\r\n}\r\nval tokens = text.map(doc =&gt; tokenize(doc))\r\n<\/pre>\n<pre>import org.apache.spark.mllib.feature.Word2Vec\r\nval word2vec = new Word2Vec()\r\nword2vec.setSeed(42) \/\/ we do this to generate the same results each time\r\nval word2vecModel = word2vec.fit(tokens)\r\nword2vecModel = word2vec.fit(tokens)\r\nword2vecModel.findSynonyms(\"research\", 10).foreach(println)\r\n<\/pre>\n<pre>word2vecModel.findSynonyms(\"france\", 10).foreach(println)\r\n<\/pre>\n<p>\/* This code is intended to be run in the Scala shell. Launch the Scala Spark shell by running .\/bin\/spark-shell from the Spark directory. You can enter each line in the shell and see the result immediately. The expected output in the Spark console is presented as commented lines following the relevant code<\/p>\n<p>The Scala shell creates a SparkContex variable available to us as &#8216;sc&#8217;<\/p>\n<p>Ensure you you start your Spark shell with enough memory: .\/bin\/spark-shell &#8211;driver-memory 4g<\/p>\n<ul>\n<li>\/<\/li>\n<\/ul>\n<p>\/* Replace &#8216;PATH&#8217; with the path to the 20 Newsgroups Data *\/ val path = &#8220;\/PATH\/20news-bydate-train\/*&#8221; val rdd = sc.wholeTextFiles(path) \/\/ count the number of records in the dataset println(rdd.count) \/* &#8230; 14\/10\/12 14:27:54 INFO FileInputFormat: Total input paths to process\u00a0: 11314 &#8230; 11314<\/p>\n<ul>\n<li>\/<\/li>\n<\/ul>\n<p>val newsgroups = rdd.map { case (file, text) =&gt; file.split(&#8220;\/&#8221;).takeRight(2).head } val countByGroup = newsgroups.map(n =&gt; (n, 1)).reduceByKey(_ + _).collect.sortBy(-_._2).mkString(&#8220;\\n&#8221;) println(countByGroup) \/* (rec.sport.hockey,600) (soc.religion.christian,599) (rec.motorcycles,598) (rec.sport.baseball,597) (sci.crypt,595) (rec.autos,594) (sci.med,594) (comp.windows.x,593) (sci.space,593) (sci.electronics,591) (comp.os.ms-windows.misc,591) (comp.sys.ibm.pc.hardware,590) (misc.forsale,585) (comp.graphics,584) (comp.sys.mac.hardware,578) (talk.politics.mideast,564) (talk.politics.guns,546) (alt.atheism,480) (talk.politics.misc,465) (talk.religion.misc,377)<\/p>\n<ul>\n<li>\/<\/li>\n<\/ul>\n<p>\/\/ Tokenizing the text data val text = rdd.map { case (file, text) =&gt; text } val whiteSpaceSplit = text.flatMap(t =&gt; t.split(&#8221; &#8220;).map(_.toLowerCase)) println(whiteSpaceSplit.distinct.count) \/\/ 402978 \/\/ inspect a look at a sample of tokens &#8211; note we set the random seed to get the same results each time println(whiteSpaceSplit.sample(true, 0.3, 42).take(100).mkString(&#8220;,&#8221;)) \/* atheist,resources summary:,addresses,,to,atheism keywords:,music,,thu,,11:57:19,11:57:19,gmt distribution:,cambridge.,290<\/p>\n<p>archive-name:,atheism\/resources alt-atheism-archive-name:,december,,,,,,,,,,,,,,,,,,,,,,addresses,addresses,,,,,,,religion,to:,to:,,p.o.,53701. telephone:,sell,the,,fish,on,their,cars,,with,and,written inside.,3d,plastic,plastic,,evolution,evolution,7119,,,,,san,san,san,mailing,net,who,to,atheist,press<\/p>\n<p>aap,various,bible,,and,on.,,,one,book,is:<\/p>\n<p>&#8220;the,w.p.,american,pp.,,1986.,bible,contains,ball,,based,based,james,of<\/p>\n<ul>\n<li>\/<\/li>\n<\/ul>\n<p>\/\/ split text on any non-word tokens val nonWordSplit = text.flatMap(t =&gt; t.split(&#8220;&#8221;&#8221;\\W+&#8221;&#8221;&#8221;).map(_.toLowerCase)) println(nonWordSplit.distinct.count) \/\/ 130126 \/\/ inspect a look at a sample of tokens println(nonWordSplit.distinct.sample(true, 0.3, 42).take(100).mkString(&#8220;,&#8221;)) \/* bone,k29p,w1w3s1,odwyer,dnj33n,bruns,_congressional,mmejv5,mmejv5,artur,125215,entitlements,beleive,1pqd9hinnbmi, jxicaijp,b0vp,underscored,believiing,qsins,1472,urtfi,nauseam,tohc4,kielbasa,ao,wargame,seetex,museum,typeset,pgva4, dcbq,ja_jp,ww4ewa4g,animating,animating,10011100b,10011100b,413,wp3d,wp3d,cannibal,searflame,ets,1qjfnv,6jx,6jx, detergent,yan,aanp,unaskable,9mf,bowdoin,chov,16mb,createwindow,kjznkh,df,classifieds,hour,cfsmo,santiago,santiago, 1r1d62,almanac_,almanac_,chq,nowadays,formac,formac,bacteriophage,barking,barking,barking,ipmgocj7b,monger,projector, hama,65e90h8y,homewriter,cl5,1496,zysec,homerific,00ecgillespie,00ecgillespie,mqh0,suspects,steve_mullins,io21087, funded,liberated,canonical,throng,0hnz,exxon,xtappcontext,mcdcup,mcdcup,5seg,biscuits<\/p>\n<ul>\n<li>\/<\/li>\n<\/ul>\n<p>\/\/ filter out numbers val regex = &#8220;&#8221;&#8221;[^0-9]*&#8221;&#8221;&#8221;.r val filterNumbers = nonWordSplit.filter(token =&gt; regex.pattern.matcher(token).matches) println(filterNumbers.distinct.count) \/\/ 84912 println(filterNumbers.distinct.sample(true, 0.3, 42).take(100).mkString(&#8220;,&#8221;)) \/* reunion,wuair,schwabam,eer,silikian,fuller,sloppiness,crying,crying,beckmans,leymarie,fowl,husky,rlhzrlhz,ignore, loyalists,goofed,arius,isgal,dfuller,neurologists,robin,jxicaijp,majorly,nondiscriminatory,akl,sively,adultery, urtfi,kielbasa,ao,instantaneous,subscriptions,collins,collins,za_,za_,jmckinney,nonmeasurable,nonmeasurable, seetex,kjvar,dcbq,randall_clark,theoreticians,theoreticians,congresswoman,sparcstaton,diccon,nonnemacher, arresed,ets,sganet,internship,bombay,keysym,newsserver,connecters,igpp,aichi,impute,impute,raffle,nixdorf, nixdorf,amazement,butterfield,geosync,geosync,scoliosis,eng,eng,eng,kjznkh,explorers,antisemites,bombardments, abba,caramate,tully,mishandles,wgtn,springer,nkm,nkm,alchoholic,chq,shutdown,bruncati,nowadays,mtearle,eastre, discernible,bacteriophage,paradijs,systematically,rluap,rluap,blown,moderates<\/p>\n<ul>\n<li>\/<\/li>\n<\/ul>\n<p>\/\/ examine potential stopwords val tokenCounts = filterNumbers.map(t =&gt; (t, 1)).reduceByKey(_ + _) val oreringDesc = Ordering.by[(String, Int), Int](_._2) println(tokenCounts.top(20)(oreringDesc).mkString(&#8220;\\n&#8221;)) \/* (the,146532) (to,75064) (of,69034) (a,64195) (ax,62406) (and,57957) (i,53036) (in,49402) (is,43480) (that,39264) (it,33638) (for,28600) (you,26682) (from,22670) (s,22337) (edu,21321) (on,20493) (this,20121) (be,19285) (t,18728)<\/p>\n<ul>\n<li>\/<\/li>\n<\/ul>\n<p>\/\/ filter out stopwords val stopwords = Set( &#8220;the&#8221;,&#8221;a&#8221;,&#8221;an&#8221;,&#8221;of&#8221;,&#8221;or&#8221;,&#8221;in&#8221;,&#8221;for&#8221;,&#8221;by&#8221;,&#8221;on&#8221;,&#8221;but&#8221;, &#8220;is&#8221;, &#8220;not&#8221;, &#8220;with&#8221;, &#8220;as&#8221;, &#8220;was&#8221;, &#8220;if&#8221;, &#8220;they&#8221;, &#8220;are&#8221;, &#8220;this&#8221;, &#8220;and&#8221;, &#8220;it&#8221;, &#8220;have&#8221;, &#8220;from&#8221;, &#8220;at&#8221;, &#8220;my&#8221;, &#8220;be&#8221;, &#8220;that&#8221;, &#8220;to&#8221; ) val tokenCountsFilteredStopwords = tokenCounts.filter { case (k, v) =&gt;\u00a0!stopwords.contains(k) } println(tokenCountsFilteredStopwords.top(20)(oreringDesc).mkString(&#8220;\\n&#8221;)) \/* (ax,62406) (i,53036) (you,26682) (s,22337) (edu,21321) (t,18728) (m,12756) (subject,12264) (com,12133) (lines,11835) (can,11355) (organization,11233) (re,10534) (what,9861) (there,9689) (x,9332) (all,9310) (will,9279) (we,9227) (one,9008)<\/p>\n<ul>\n<li>\/<\/li>\n<\/ul>\n<p>\/\/ filter out tokens less than 2 characters val tokenCountsFilteredSize = tokenCountsFilteredStopwords.filter { case (k, v) =&gt; k.size &gt;= 2 } println(tokenCountsFilteredSize.top(20)(oreringDesc).mkString(&#8220;\\n&#8221;)) \/* (ax,62406) (you,26682) (edu,21321) (subject,12264) (com,12133) (lines,11835) (can,11355) (organization,11233) (re,10534) (what,9861) (there,9689) (all,9310) (will,9279) (we,9227) (one,9008) (would,8905) (do,8674) (he,8441) (about,8336) (writes,7844)<\/p>\n<ul>\n<li>\/<\/li>\n<\/ul>\n<p>\/\/ examine tokens with least occurrence val oreringAsc = Ordering.by[(String, Int), Int](-_._2) println(tokenCountsFilteredSize.top(20)(oreringAsc).mkString(&#8220;\\n&#8221;)) \/* (lennips,1) (bluffing,1) (preload,1) (altina,1) (dan_jacobson,1) (vno,1) (actu,1) (donnalyn,1) (ydag,1) (mirosoft,1) (xiconfiywindow,1) (harger,1) (feh,1) (bankruptcies,1) (uncompression,1) (d_nibby,1) (bunuel,1) (odf,1) (swith,1) (lantastic,1)<\/p>\n<ul>\n<li>\/<\/li>\n<\/ul>\n<p>\/\/ filter out rare tokens with total occurence &lt; 2 val rareTokens = tokenCounts.filter{ case (k, v) =&gt; v &lt; 2 }.map { case (k, v) =&gt; k }.collect.toSet val tokenCountsFilteredAll = tokenCountsFilteredSize.filter { case (k, v) =&gt;\u00a0!rareTokens.contains(k) } println(tokenCountsFilteredAll.top(20)(oreringAsc).mkString(&#8220;\\n&#8221;)) \/* (sina,2) (akachhy,2) (mvd,2) (hizbolah,2) (wendel_clark,2) (sarkis,2) (purposeful,2) (feagans,2) (wout,2) (uneven,2) (senna,2) (multimeters,2) (bushy,2) (subdivided,2) (coretest,2) (oww,2) (historicity,2) (mmg,2) (margitan,2) (defiance,2)<\/p>\n<ul>\n<li>\/<\/li>\n<\/ul>\n<p>println(tokenCountsFilteredAll.count) \/\/ 51801<\/p>\n<p>\/\/ create a function to tokenize each document def tokenize(line: String): Seq[String] = { line.split(&#8220;&#8221;&#8221;\\W+&#8221;&#8221;&#8221;) .map(_.toLowerCase) .filter(token =&gt; regex.pattern.matcher(token).matches) .filterNot(token =&gt; stopwords.contains(token)) .filterNot(token =&gt; rareTokens.contains(token)) .filter(token =&gt; token.size &gt;= 2) .toSeq }<\/p>\n<p>\/\/ check that our tokenizer achieves the same result as all the steps above println(text.flatMap(doc =&gt; tokenize(doc)).distinct.count) \/\/ 51801 \/\/ tokenize each document val tokens = text.map(doc =&gt; tokenize(doc)) println(tokens.first.take(20)) \/* WrappedArray(mathew, mathew, mantis, co, uk, subject, alt, atheism, faq, atheist, resources, summary, books, addresses, music, anything, related, atheism, keywords, faq)<\/p>\n<ul>\n<li>\/<\/li>\n<\/ul>\n<p>\/\/ === train TF-IDF model === \/\/<\/p>\n<p>import org.apache.spark.mllib.linalg.{ SparseVector =&gt; SV } import org.apache.spark.mllib.feature.HashingTF import org.apache.spark.mllib.feature.IDF \/\/ set the dimensionality of TF-IDF vectors to 2^18 val dim = math.pow(2, 18).toInt val hashingTF = new HashingTF(dim)<\/p>\n<p>val tf = hashingTF.transform(tokens) \/\/ cache data in memory tf.cache val v = tf.first.asInstanceOf[SV] println(v.size) \/\/ 262144 println(v.values.size) \/\/ 706 println(v.values.take(10).toSeq) \/\/ WrappedArray(1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0, 1.0) println(v.indices.take(10).toSeq) \/\/ WrappedArray(313, 713, 871, 1202, 1203, 1209, 1795, 1862, 3115, 3166)<\/p>\n<p>val idf = new IDF().fit(tf) val tfidf = idf.transform(tf) val v2 = tfidf.first.asInstanceOf[SV] println(v2.values.size) \/\/ 706 println(v2.values.take(10).toSeq) \/\/ WrappedArray(2.3869085659322193, 4.670445463955571, 6.561295835827856, 4.597686109673142, &#8230; println(v2.indices.take(10).toSeq) \/\/ WrappedArray(313, 713, 871, 1202, 1203, 1209, 1795, 1862, 3115, 3166)<\/p>\n<p>\/\/ min and max tf-idf scores val minMaxVals = tfidf.map { v =&gt; val sv = v.asInstanceOf[SV] (sv.values.min, sv.values.max) } val globalMinMax = minMaxVals.reduce { case ((min1, max1), (min2, max2)) =&gt; (math.min(min1, min2), math.max(max1, max2)) } println(globalMinMax) \/\/ (0.0,66155.39470409753)<\/p>\n<p>\/\/ test out a few common words val common = sc.parallelize(Seq(Seq(&#8220;you&#8221;, &#8220;do&#8221;, &#8220;we&#8221;))) val tfCommon = hashingTF.transform(common) val tfidfCommon = idf.transform(tfCommon) val commonVector = tfidfCommon.first.asInstanceOf[SV] println(commonVector.values.toSeq) \/\/ WrappedArray(0.9965359935704624, 1.3348773448236835, 0.5457486182039175)<\/p>\n<p>\/\/ test out a few uncommon words val uncommon = sc.parallelize(Seq(Seq(&#8220;telescope&#8221;, &#8220;legislation&#8221;, &#8220;investment&#8221;))) val tfUncommon = hashingTF.transform(uncommon) val tfidfUncommon = idf.transform(tfUncommon) val uncommonVector = tfidfUncommon.first.asInstanceOf[SV] println(uncommonVector.values.toSeq) \/\/ WrappedArray(5.3265513728351666, 5.308532867332488, 5.483736956357579)<\/p>\n<p>\/\/ === document similarity === \/\/<\/p>\n<p>val hockeyText = rdd.filter { case (file, text) =&gt; file.contains(&#8220;hockey&#8221;) } \/\/ note that the &#8216;transform&#8217; method used below is the one which works on a single document \/\/ in the form of a Seq[String], rather than the version which works on an RDD of documents val hockeyTF = hockeyText.mapValues(doc =&gt; hashingTF.transform(tokenize(doc))) val hockeyTfIdf = idf.transform(hockeyTF.map(_._2))<\/p>\n<p>\/\/ compute cosine similarity using Breeze import breeze.linalg._ val hockey1 = hockeyTfIdf.sample(true, 0.1, 42).first.asInstanceOf[SV] val breeze1 = new SparseVector(hockey1.indices, hockey1.values, hockey1.size) val hockey2 = hockeyTfIdf.sample(true, 0.1, 43).first.asInstanceOf[SV] val breeze2 = new SparseVector(hockey2.indices, hockey2.values, hockey2.size) val cosineSim = breeze1.dot(breeze2) \/ (norm(breeze1) * norm(breeze2)) println(cosineSim) \/\/ 0.060250114361164626<\/p>\n<p>\/\/ compare to comp.graphics topic val graphicsText = rdd.filter { case (file, text) =&gt; file.contains(&#8220;comp.graphics&#8221;) } val graphicsTF = graphicsText.mapValues(doc =&gt; hashingTF.transform(tokenize(doc))) val graphicsTfIdf = idf.transform(graphicsTF.map(_._2)) val graphics = graphicsTfIdf.sample(true, 0.1, 42).first.asInstanceOf[SV] val breezeGraphics = new SparseVector(graphics.indices, graphics.values, graphics.size) val cosineSim2 = breeze1.dot(breezeGraphics) \/ (norm(breeze1) * norm(breezeGraphics)) println(cosineSim2) \/\/ 0.004664850323792852<\/p>\n<p>\/\/ compare to sport.baseball topic val baseballText = rdd.filter { case (file, text) =&gt; file.contains(&#8220;baseball&#8221;) } val baseballTF = baseballText.mapValues(doc =&gt; hashingTF.transform(tokenize(doc))) val baseballTfIdf = idf.transform(baseballTF.map(_._2)) val baseball = baseballTfIdf.sample(true, 0.1, 42).first.asInstanceOf[SV] val breezeBaseball = new SparseVector(baseball.indices, baseball.values, baseball.size) val cosineSim3 = breeze1.dot(breezeBaseball) \/ (norm(breeze1) * norm(breezeBaseball)) println(cosineSim3) \/\/ 0.05047395039466008<\/p>\n<p>\/\/ === document classification === \/\/<\/p>\n<p>import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.classification.NaiveBayes import org.apache.spark.mllib.evaluation.MulticlassMetrics<\/p>\n<p>val newsgroupsMap = newsgroups.distinct.collect().zipWithIndex.toMap val zipped = newsgroups.zip(tfidf) val train = zipped.map { case (topic, vector) =&gt; LabeledPoint(newsgroupsMap(topic), vector) } train.cache val model = NaiveBayes.train(train, lambda = 0.1)<\/p>\n<p>val testPath = &#8220;\/PATH\/20news-bydate-test\/*&#8221; val testRDD = sc.wholeTextFiles(testPath) val testLabels = testRDD.map { case (file, text) =&gt; val topic = file.split(&#8220;\/&#8221;).takeRight(2).head newsgroupsMap(topic) } val testTf = testRDD.map { case (file, text) =&gt; hashingTF.transform(tokenize(text)) } val testTfIdf = idf.transform(testTf) val zippedTest = testLabels.zip(testTfIdf) val test = zippedTest.map { case (topic, vector) =&gt; LabeledPoint(topic, vector) }<\/p>\n<p>val predictionAndLabel = test.map(p =&gt; (model.predict(p.features), p.label)) val accuracy = 1.0 * predictionAndLabel.filter(x =&gt; x._1 == x._2).count() \/ test.count() println(accuracy) \/\/ 0.7915560276155071 val metrics = new MulticlassMetrics(predictionAndLabel) println(metrics.weightedFMeasure) \/\/ 0.7810675969031116<\/p>\n<p>\/\/ test on raw token features val rawTokens = rdd.map { case (file, text) =&gt; text.split(&#8221; &#8220;) } val rawTF = rawTokens.map(doc =&gt; hashingTF.transform(doc)) val rawTrain = newsgroups.zip(rawTF).map { case (topic, vector) =&gt; LabeledPoint(newsgroupsMap(topic), vector) } val rawModel = NaiveBayes.train(rawTrain, lambda = 0.1) val rawTestTF = testRDD.map { case (file, text) =&gt; hashingTF.transform(text.split(&#8221; &#8220;)) } val rawZippedTest = testLabels.zip(rawTestTF) val rawTest = rawZippedTest.map { case (topic, vector) =&gt; LabeledPoint(topic, vector) } val rawPredictionAndLabel = rawTest.map(p =&gt; (rawModel.predict(p.features), p.label)) val rawAccuracy = 1.0 * rawPredictionAndLabel.filter(x =&gt; x._1 == x._2).count() \/ rawTest.count() println(rawAccuracy) \/\/ 0.7661975570897503 val rawMetrics = new MulticlassMetrics(rawPredictionAndLabel) println(rawMetrics.weightedFMeasure) \/\/ 0.7628947184990661<\/p>\n<p>\/\/ === Word2Vec === \/<\/p>\n<p>import org.apache.spark.mllib.feature.Word2Vec val word2vec = new Word2Vec() word2vec.setSeed(42) \/\/ we do this to generate the same results each time val word2vecModel = word2vec.fit(tokens) \/* &#8230; 14\/10\/25 14:21:59 INFO Word2Vec: wordCount = 2133172, alpha = 0.0011868763094487506 14\/10\/25 14:21:59 INFO Word2Vec: wordCount = 2144172, alpha = 0.0010640806039941193 14\/10\/25 14:21:59 INFO Word2Vec: wordCount = 2155172, alpha = 9.412848985394907E-4 14\/10\/25 14:21:59 INFO Word2Vec: wordCount = 2166172, alpha = 8.184891930848592E-4 14\/10\/25 14:22:00 INFO Word2Vec: wordCount = 2177172, alpha = 6.956934876302307E-4 14\/10\/25 14:22:00 INFO Word2Vec: wordCount = 2188172, alpha = 5.728977821755993E-4 14\/10\/25 14:22:00 INFO Word2Vec: wordCount = 2199172, alpha = 4.501020767209707E-4 14\/10\/25 14:22:00 INFO Word2Vec: wordCount = 2210172, alpha = 3.2730637126634213E-4 14\/10\/25 14:22:01 INFO Word2Vec: wordCount = 2221172, alpha = 2.0451066581171076E-4 14\/10\/25 14:22:01 INFO Word2Vec: wordCount = 2232172, alpha = 8.171496035708214E-5 &#8230; 14\/10\/25 14:22:02 INFO SparkContext: Job finished: collect at Word2Vec.scala:368, took 56.585983 s 14\/10\/25 14:22:02 INFO MappedRDD: Removing RDD 200 from persistence list 14\/10\/25 14:22:02 INFO BlockManager: Removing RDD 200 14\/10\/25 14:22:02 INFO BlockManager: Removing block rdd_200_0 14\/10\/25 14:22:02 INFO MemoryStore: Block rdd_200_0 of size 9008840 dropped from memory (free 1755596828) word2vecModel: org.apache.spark.mllib.feature.Word2VecModel = org.apache.spark.mllib.feature.Word2VecModel@2b94e480<\/p>\n<ul>\n<li>\/<\/li>\n<\/ul>\n<p>\/\/ evaluate a few words word2vecModel.findSynonyms(&#8220;hockey&#8221;, 20).foreach(println) \/* (sport,0.6828256249427795) (ecac,0.6718048453330994) (hispanic,0.6519884467124939) (glens,0.6447514891624451) (woofers,0.6351765394210815) (boxscores,0.6009076237678528) (tournament,0.6006366014480591) (champs,0.5957855582237244) (aargh,0.584071934223175) (playoff,0.5834275484085083) (ahl,0.5784651637077332) (ncaa,0.5680188536643982) (pool,0.5612311959266663) (olympic,0.5552600026130676) (champion,0.5549421310424805) (filinuk,0.5528956651687622) (yankees,0.5502706170082092) (motorcycles,0.5484763979911804) (calder,0.5481109023094177) (rec,0.5432182550430298)<\/p>\n<ul>\n<li>\/<\/li>\n<\/ul>\n<p>word2vecModel.findSynonyms(&#8220;legislation&#8221;, 20).foreach(println) \/* (accommodates,0.8149217963218689) (briefed,0.7582570314407349) (amended,0.7310371994972229) (telephony,0.7139414548873901) (aclu,0.7080780863761902) (pitted,0.7062571048736572) (licensee,0.6981208324432373) (agency,0.6880651712417603) (policies,0.6828961372375488) (senate,0.6821110844612122) (businesses,0.6814320087432861) (permit,0.6797110438346863) (cpsr,0.6764014959335327) (cooperation,0.6733141541481018) (surveillance,0.6670728325843811) (restricted,0.6666574478149414) (congress,0.6661365628242493) (procure,0.6655452251434326) (industry,0.6650314927101135) (inquiry,0.6644254922866821)<\/p>\n<ul>\n<li>\/<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>This page aims at creating a &#8220;copy-paste&#8221;-like tutorial to run your first Spark MLlib script. Requirements SSH (for Windows, use PuTTY and see how to create a key with PuTTY) An account in the DAPLAB, and send your ssh public key to Benoit. A browser &#8212; well, if you can access this page, you should [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":68,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[12],"tags":[22,21,14],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v15.6.2 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Daplab - Spark MLlib Hello World<\/title>\n<meta name=\"description\" content=\"This page aims at creating a &quot;copy-paste&quot;-like tutorial to run your first Spark MLlib script.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"http:\/\/daplab.ch\/2015\/09\/21\/spark-mllib-hello-world\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Daplab - Spark MLlib Hello World\" \/>\n<meta property=\"og:description\" content=\"This page aims at creating a &quot;copy-paste&quot;-like tutorial to run your first Spark MLlib script.\" \/>\n<meta property=\"og:url\" content=\"http:\/\/daplab.ch\/2015\/09\/21\/spark-mllib-hello-world\/\" \/>\n<meta property=\"og:site_name\" content=\"DAPLAB - Data Analysis and Processing Lab\" \/>\n<meta property=\"article:published_time\" content=\"2015-09-21T13:04:24+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2015-10-07T12:26:30+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/daplab.ch\/wp-content\/uploads\/2015\/09\/spark-logo.png\" \/>\n\t<meta property=\"og:image:width\" content=\"258\" \/>\n\t<meta property=\"og:image:height\" content=\"137\" \/>\n<meta name=\"twitter:card\" content=\"summary\" \/>\n<meta name=\"twitter:label1\" content=\"Est. reading time\">\n\t<meta name=\"twitter:data1\" content=\"11 minutes\">\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebSite\",\"@id\":\"https:\/\/daplab.ch\/#website\",\"url\":\"https:\/\/daplab.ch\/\",\"name\":\"DAPLAB - Data Analysis and Processing Lab\",\"description\":\"Reduces the entry barrier for companies to find value out of their data and ultimately turn into a data-driven company\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":\"https:\/\/daplab.ch\/?s={search_term_string}\",\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"ImageObject\",\"@id\":\"http:\/\/daplab.ch\/2015\/09\/21\/spark-mllib-hello-world\/#primaryimage\",\"inLanguage\":\"en-US\",\"url\":\"https:\/\/daplab.ch\/wp-content\/uploads\/2015\/09\/spark-logo.png\",\"width\":258,\"height\":137},{\"@type\":\"WebPage\",\"@id\":\"http:\/\/daplab.ch\/2015\/09\/21\/spark-mllib-hello-world\/#webpage\",\"url\":\"http:\/\/daplab.ch\/2015\/09\/21\/spark-mllib-hello-world\/\",\"name\":\"Daplab - Spark MLlib Hello World\",\"isPartOf\":{\"@id\":\"https:\/\/daplab.ch\/#website\"},\"primaryImageOfPage\":{\"@id\":\"http:\/\/daplab.ch\/2015\/09\/21\/spark-mllib-hello-world\/#primaryimage\"},\"datePublished\":\"2015-09-21T13:04:24+00:00\",\"dateModified\":\"2015-10-07T12:26:30+00:00\",\"author\":{\"@id\":\"https:\/\/daplab.ch\/#\/schema\/person\/a2e383674c25ef6233dd15f3d269c7c0\"},\"description\":\"This page aims at creating a \\\"copy-paste\\\"-like tutorial to run your first Spark MLlib script.\",\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"http:\/\/daplab.ch\/2015\/09\/21\/spark-mllib-hello-world\/\"]}]},{\"@type\":\"Person\",\"@id\":\"https:\/\/daplab.ch\/#\/schema\/person\/a2e383674c25ef6233dd15f3d269c7c0\",\"name\":\"Daplab\",\"image\":{\"@type\":\"ImageObject\",\"@id\":\"https:\/\/daplab.ch\/#personlogo\",\"inLanguage\":\"en-US\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/dc16fc8fae1e733c6bef9181443db346?s=96&d=mm&r=g\",\"caption\":\"Daplab\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","_links":{"self":[{"href":"https:\/\/daplab.ch\/wp-json\/wp\/v2\/posts\/64"}],"collection":[{"href":"https:\/\/daplab.ch\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/daplab.ch\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/daplab.ch\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/daplab.ch\/wp-json\/wp\/v2\/comments?post=64"}],"version-history":[{"count":1,"href":"https:\/\/daplab.ch\/wp-json\/wp\/v2\/posts\/64\/revisions"}],"predecessor-version":[{"id":65,"href":"https:\/\/daplab.ch\/wp-json\/wp\/v2\/posts\/64\/revisions\/65"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/daplab.ch\/wp-json\/wp\/v2\/media\/68"}],"wp:attachment":[{"href":"https:\/\/daplab.ch\/wp-json\/wp\/v2\/media?parent=64"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/daplab.ch\/wp-json\/wp\/v2\/categories?post=64"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/daplab.ch\/wp-json\/wp\/v2\/tags?post=64"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}