Scikit-Learn 教學:Python 與機器學習

文章推薦指數: 80 %
投票人數:10人

簡單易懂的scikit-learn 教學,適合想要使用Python 實作機器學習的初學者閱讀。

The DataCamp team is thrilled to announce that our Python Machine ... SkiptomaincontentSignInGetStartedBlogBlogArticlesPodcastTutorialsCheatSheetsCategoryCategoryAboutDataCampLatestnewsaboutourproductsandteamForBusinessCategoryTechnologiesDiscovercontentbytoolsandtechnologyGitPowerBIPythonRProgrammingScalaSpreadsheetsSQLTableauCategoryTopicsDiscovercontentbydatasciencetopicsAIBigDataDataAnalysisDataEngineeringDataLiteracyDataScienceDataVisualizationDeepLearningMachineLearningWorkspaceWriteforusCategorySearchTheDataCampteamisthrilledtoannouncethatourPythonMachineLearning:Scikit-LearnTutorialhasbeengenerouslytranslatedbyourfriendandDataCampuserTonyYao-JenKuotoTraditionalChinese! 使用Python實作機器學習 機器學習是一門設計如何讓演算法能夠學習的電腦科學,讓機器能夠透過觀察已知的資料學習預測未知的資料。

典型的應用包含概念學習(Conceptlearning)、函數學習(Functionlearning)、預測模型(Predictivemodeling)、分群(Clustering)與找尋預測特徵(Findingpredictivepatterns)。

終極目標是讓電腦能夠自行提升學習能力,預測未知資料的準確性能夠隨著已知資料的增加而提高,節省使用者人工調整校正的精力。

機器學習跟知識發掘(KnowledgeDiscovery)、資料採礦(DataMining)、人工智慧(ArtificialIntelligence,AI)以及統計(Statistics)有著密不可分的關係,應用範圍從學術研究到商業應用,從機器人科學家到垃圾郵件篩選與推薦系統,都可見其蹤影。

要成為一個優秀的資料科學家,機器學習是不可或缺的技能,這份教學會從零開始介紹如何使用Python來實作機器學習,並且示範如何使用一些非監督式與監督式的機器學習演算法。

如果您對於使用R語言來實作機器學習更有興趣,可以參閱MachineLearningwithRforBeginnerstutorial。

讀入資料 跟任何的資料科學專案相同,我們在教學的一開始就是將資料讀入Python的開發環境。

如果您是一位機器學習的初學者,我們推薦三個很棒的資料來源,分別是加州大學Irvine分校的機器學習資料集、Kaggle網站與KDNuggets整理的資料集資源。

好,讓我們來暖身一下,利用Python的機器學習套件scikit-learn將一個叫作digits的資料讀入。

冷知識:scikit-learn源於於SciPy,事實上scikit有很多個,我們使用的scikit-learn套件是專門用來實作機器學習以及資料採礦的,這也是為什麼使用learn來命名:) 我們首先由sklearn套件載入datasets模組,然後使用datasets模組的load_digits()方法來輸入資料,試著依照註解的提示完成程式後點選Run觀察結果,如果沒有頭緒,可以點選Solution將程式完成後再點選Run觀察結果: eyJsYW5ndWFnZSI6InB5dGhvbiIsInNhbXBsZSI6IiMgSW1wb3J0IGBkYXRhc2V0c2AgZnJvbSBgc2tsZWFybmBcbmZyb20gc2tsZWFybiBpbXBvcnQgX19fX19fX19cblxuIyBMb2FkIGluIHRoZSBgZGlnaXRzYCBkYXRhXG5kaWdpdHMgPSBkYXRhc2V0cy5sb2FkX2RpZ2l0cygpXG5cbiMgUHJpbnQgdGhlIGBkaWdpdHNgIGRhdGEgXG5wcmludChfX19fX18pIiwic29sdXRpb24iOiIjIEltcG9ydCBgZGF0YXNldHNgIGZyb20gYHNrbGVhcm5gXG5mcm9tIHNrbGVhcm4gaW1wb3J0IGRhdGFzZXRzXG5cbiMgTG9hZCBpbiB0aGUgYGRpZ2l0c2AgZGF0YVxuZGlnaXRzID0gZGF0YXNldHMubG9hZF9kaWdpdHMoKVxuXG4jIFByaW50IHRoZSBgZGlnaXRzYCBkYXRhIFxucHJpbnQoZGlnaXRzKSIsInNjdCI6ImltcG9ydF9tc2c9XCJEaWQgeW91IGltcG9ydCBgZGF0YXNldHNgIGZyb20gYHNrbGVhcm5gP1wiXG5pbmNvcnJlY3RfaW1wb3J0X21zZz1cIkRvbid0IGZvcmdldCB0byBpbXBvcnQgdGhlIGBkYXRhc2V0c2AgbW9kdWxlIGZyb20gYHNrbGVhcm5gIVwiXG5ub3RfY2FsbGVkX21zZz1cIkRpZCB5b3UgdXNlIGBkYXRhc2V0cy5sb2FkX2RpZ2l0cygpYCB0byBsb2FkIGluIHRoZSBgZGlnaXRzYCBkYXRhP1wiXG5pbmNvcnJlY3RfbXNnPVwiVXNlIGBkYXRhc2V0cy5sb2FkX2RpZ2l0cygpYCB0byBsb2FkIGluIHRoZSBgZGlnaXRzYCBkYXRhIVwiXG5wcmVkZWZfbXNnPVwiRGlkIHlvdSBjYWxsIHRoZSBgcHJpbnQoKWAgZnVuY3Rpb24/XCJcbnRlc3RfaW1wb3J0KFwic2tsZWFybi5kYXRhc2V0c1wiLCBzYW1lX2FzID0gVHJ1ZSwgbm90X2ltcG9ydGVkX21zZyA9IGltcG9ydF9tc2csIGluY29ycmVjdF9hc19tc2cgPSBpbmNvcnJlY3RfaW1wb3J0X21zZylcbnRlc3RfZnVuY3Rpb24oXCJza2xlYXJuLmRhdGFzZXRzLmxvYWRfZGlnaXRzXCIsIG5vdF9jYWxsZWRfbXNnID0gbm90X2NhbGxlZF9tc2csIGluY29ycmVjdF9tc2cgPSBpbmNvcnJlY3RfbXNnKVxuIyBUZXN0IGBwcmludCgpYCBmdW5jdGlvblxudGVzdF9mdW5jdGlvbihcbiAgICBcInByaW50XCIsXG4gICAgbm90X2NhbGxlZF9tc2c9cHJlZGVmX21zZyxcbiAgICBpbmNvcnJlY3RfbXNnPXByZWRlZl9tc2csXG4gICAgZG9fZXZhbD1GYWxzZVxuKVxuc3VjY2Vzc19tc2c9XCJQZXJmZWN0ISBZb3UncmUgcmVhZHkgdG8gZ28hXCIifQ== datasets模組還有其他讀取資料的方法,您也可以用它來產生虛擬資料。

我們現在所使用的資料集digits也可以從加州大學Irvine分校的機器學習資料集載入,您可以在這個連結找到。

假如您想要從加州大學Irvine分校的機器學習資料集載入digits,讀入資料的程式寫法會變得像這樣,試著依照註解的提示完成程式後點選Run觀察結果,如果沒有頭緒,可以點選Solution將程式完成後再點選Run觀察結果: eyJsYW5ndWFnZSI6InB5dGhvbiIsInNhbXBsZSI6IiMgSW1wb3J0IHRoZSBgcGFuZGFzYCBsaWJyYXJ5IGFzIGBwZGBcbmltcG9ydCBfX19fX18gYXMgX19cblxuIyBMb2FkIGluIHRoZSBkYXRhIHdpdGggYHJlYWRfY3N2KClgXG5kaWdpdHMgPSBwZC5yZWFkX2NzdihcImh0dHA6Ly9hcmNoaXZlLmljcy51Y2kuZWR1L21sL21hY2hpbmUtbGVhcm5pbmctZGF0YWJhc2VzL29wdGRpZ2l0cy9vcHRkaWdpdHMudHJhXCIsIGhlYWRlcj1Ob25lKVxuXG4jIFByaW50IG91dCBgZGlnaXRzYFxucHJpbnQoX19fX19fKSIsInNvbHV0aW9uIjoiIyBJbXBvcnQgdGhlIGBwYW5kYXNgIGxpYnJhcnkgYXMgYHBkYFxuaW1wb3J0IHBhbmRhcyBhcyBwZFxuXG4jIExvYWQgaW4gdGhlIGRhdGEgd2l0aCBgcmVhZF9jc3YoKWBcbmRpZ2l0cyA9IHBkLnJlYWRfY3N2KFwiaHR0cDovL2FyY2hpdmUuaWNzLnVjaS5lZHUvbWwvbWFjaGluZS1sZWFybmluZy1kYXRhYmFzZXMvb3B0ZGlnaXRzL29wdGRpZ2l0cy50cmFcIiwgaGVhZGVyPU5vbmUpXG5cbiMgUHJpbnQgb3V0IGBkaWdpdHNgXG5wcmludChkaWdpdHMpIiwic2N0IjoiaW1wb3J0X21zZz1cIkRpZCB5b3UgYWRkIHNvbWUgY29kZSB0byBpbXBvcnQgYHBhbmRhc2AgYXMgYHBkYD9cIlxuaW5jb3JyZWN0X2ltcG9ydF9tc2c9XCJEb24ndCBmb3JnZXQgdG8gaW1wb3J0IHRoZSAncGFuZGFzJyBsaWJyYXJ5IGFzIGBwZGAhXCJcbmNzdl9tc2c9XCJEaWQgeW91IHVzZSB0aGUgYHJlYWRfY3N2KClgIG1ldGhvZCBmcm9tIHBhbmRhcyB0byBsb2FkIGluIHRoZSBkYXRhP1wiXG5jc3ZfaW5jb3JyZWN0X21zZz1cIlVzZSBgcmVhZF9jc3YoKWAgZnJvbSB0aGUgcGFuZGFzIGxpYnJhcnkgdG8gbG9hZCBpbiB0aGUgZGF0YSBcIlxucHJlZGVmX21zZz1cIkRpZCB5b3UgY2FsbCB0aGUgYHByaW50KClgIGZ1bmN0aW9uP1wiXG4jIFRlc3QgaW1wb3J0IGBwYW5kYXNgXG50ZXN0X2ltcG9ydChcInBhbmRhc1wiLCBzYW1lX2FzID0gVHJ1ZSwgbm90X2ltcG9ydGVkX21zZyA9IGltcG9ydF9tc2csIGluY29ycmVjdF9hc19tc2cgPSBpbmNvcnJlY3RfaW1wb3J0X21zZylcbiMgVGVzdCBgcmVhZF9jc3YoKWBcbnRlc3RfZnVuY3Rpb24oXCJwYW5kYXMucmVhZF9jc3ZcIiwgbm90X2NhbGxlZF9tc2cgPSBjc3ZfbXNnLCBpbmNvcnJlY3RfbXNnID0gY3N2X2luY29ycmVjdF9tc2cpXG4jIFRlc3QgYHByaW50KClgIGZ1bmN0aW9uXG50ZXN0X2Z1bmN0aW9uKFxuICAgIFwicHJpbnRcIixcbiAgICBub3RfY2FsbGVkX21zZz1wcmVkZWZfbXNnLFxuICAgIGluY29ycmVjdF9tc2c9cHJlZGVmX21zZyxcbiAgICBkb19ldmFsPUZhbHNlXG4pXG5zdWNjZXNzX21zZyhcIkF3ZXNvbWUgam9iIVwiKSJ9 值得注意的是,從檔名的.tra與.tes可以得知,加州大學Irvine分校的機器學習資料集已經切分好訓練與測試資料,而上面這段程式中我們只讀入了訓練資料,如果要實作機器學習則還需要再讀入測試資料, 秘訣:想學習更多使用Python的Pandas套件來讀入與整理資料的技巧,可以參閱ImportingDatainPythoncourse。

探索資料 仔細閱讀資料的文件或描述是很好的習慣,加州大學Irvine分校的機器學習資料集針對每個資料都有提供文件,閱讀文件可以提高我們對資料的瞭解程度。

然而光是初步認識還是略嫌不足,接著我們要進行的是探索性分析(Exploratorydataanalysis),我們又該從何開始探索這些手寫數字圖片資料呢? 搜集基本資訊 假如我們直接透過scikit-learn讀入digits資料,那麼不同於加州大學Irvine分校的機器學習資料集在網頁中提供描述或文件,我們必須另外透過使用digits的屬性與方法來搜集基本資訊。

我們將透過digits的keys()方法來得知有哪些基本資訊可以搜集;透過data屬性觀察預測變數;透過target屬性觀察目標變數;透過DESCR屬性閱讀資料的描述文件。

試著依照註解的提示完成程式後點選Run觀察結果,如果沒有頭緒,可以點選Solution將程式完成後再點選Run觀察結果: eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiZnJvbSBza2xlYXJuIGltcG9ydCBkYXRhc2V0c1xuZGlnaXRzID0gZGF0YXNldHMubG9hZF9kaWdpdHMoKSIsInNhbXBsZSI6IiMgR2V0IHRoZSBrZXlzIG9mIHRoZSBgZGlnaXRzYCBkYXRhXG5wcmludChkaWdpdHMuX19fX19fKVxuXG4jIFByaW50IG91dCB0aGUgZGF0YVxucHJpbnQoZGlnaXRzLl9fX18pXG5cbiMgUHJpbnQgb3V0IHRoZSB0YXJnZXQgdmFsdWVzXG5wcmludChkaWdpdHMuX19fX19fKVxuXG4jIFByaW50IG91dCB0aGUgZGVzY3JpcHRpb24gb2YgdGhlIGBkaWdpdHNgIGRhdGFcbnByaW50KGRpZ2l0cy5ERVNDUikiLCJzb2x1dGlvbiI6IiMgR2V0IHRoZSBrZXlzIG9mIHRoZSBgZGlnaXRzYCBkYXRhXG5wcmludChkaWdpdHMua2V5cygpKVxuXG4jIFByaW50IG91dCB0aGUgZGF0YVxucHJpbnQoZGlnaXRzLmRhdGEpXG5cbiMgUHJpbnQgb3V0IHRoZSB0YXJnZXQgdmFsdWVzXG5wcmludChkaWdpdHMudGFyZ2V0KVxuXG4jIFByaW50IG91dCB0aGUgZGVzY3JpcHRpb24gb2YgdGhlIGBkaWdpdHNgIGRhdGFcbnByaW50KGRpZ2l0cy5ERVNDUikiLCJzY3QiOiIjIFRlc3QgYHByaW50YCBcbnRlc3RfZnVuY3Rpb24oXG4gICAgXCJwcmludFwiLFxuICAgIDEsXG4gICAgbm90X2NhbGxlZF9tc2c9XCJEaWQgeW91IHByaW50IG91dCB0aGUga2V5cyBvZiBgZGlnaXRzYD9cIixcbiAgICBpbmNvcnJlY3RfbXNnPVwiRG9uJ3QgZm9yZ2V0IHRvIHByaW50IG91dCB0aGUga2V5cyBvZiBgZGlnaXRzYCFcIixcbiAgICBkb19ldmFsPUZhbHNlXG4pXG4jIFRlc3QgYHByaW50YFxudGVzdF9mdW5jdGlvbihcbiAgICBcInByaW50XCIsXG4gICAgMixcbiAgICBub3RfY2FsbGVkX21zZz1cIkRpZCB5b3UgcHJpbnQgb3V0IHRoZSBkYXRhP1wiLFxuICAgIGluY29ycmVjdF9tc2c9XCJEb24ndCBmb3JnZXQgdG8gcHJpbnQgb3V0IHRoZSBkYXRhIVwiLFxuICAgIGRvX2V2YWw9RmFsc2VcbilcbiMgVGVzdCBgcHJpbnRgXG50ZXN0X2Z1bmN0aW9uKFxuICAgIFwicHJpbnRcIixcbiAgICAzLFxuICAgIG5vdF9jYWxsZWRfbXNnPVwiRGlkIHlvdSBwcmludCBvdXQgdGhlIHRhcmdldCB2YWx1ZXMgb2YgdGhlIGRhdGE/XCIsXG4gICAgaW5jb3JyZWN0X21zZz1cIkRvbid0IGZvcmdldCB0byBwcmludCBvdXQgdGhlIHRhcmdldCB2YWx1ZXMgb2YgdGhlIGRhdGEhXCIsXG4gICAgZG9fZXZhbD1GYWxzZVxuKVxuIyBUZXN0IGBwcmludGAgXG50ZXN0X2Z1bmN0aW9uKFxuICAgIFwicHJpbnRcIixcbiAgICA0LFxuICAgIG5vdF9jYWxsZWRfbXNnPVwiRGlkIHlvdSBwcmludCBvdXQgdGhlIGRlc2NyaXB0aW9uIG9mIGBkaWdpdHNgP1wiLFxuICAgIGluY29ycmVjdF9tc2c9XCJEb24ndCBmb3JnZXQgdG8gcHJpbnQgb3V0IHRoZSBkZXNjcmlwdGlvbiBvZiBgZGlnaXRzYCFcIixcbiAgICBkb19ldmFsPUZhbHNlXG4pXG5zdWNjZXNzX21zZyhcIkF3ZXNvbWUhXCIpIn0= 接著我們回憶一下在第一個練習中印出的digits,裡頭出現很多的numpy陣列,了解陣列最重要的特性是形狀(shape)。

假如我們有一個3d陣列:y=np.zeros((2,3,4)),這個陣列的形狀就是(2,3,4),由整數組成的tuple資料結構。

我們延續前一個練習來觀察data、target、DESCR與images的形狀,利用digits的data屬性將這個陣列獨立指派給digits_data並檢視其shape屬性,並且對另外三個屬性也依樣畫葫蘆進行相同的操作,試著依照註解的提示完成程式後點選Run觀察結果,如果沒有頭緒,可以點選Solution將程式完成後再點選Run觀察結果: eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiZnJvbSBza2xlYXJuIGltcG9ydCBkYXRhc2V0c1xuaW1wb3J0IG51bXB5IGFzIG5wXG5kaWdpdHMgPSBkYXRhc2V0cy5sb2FkX2RpZ2l0cygpIiwic2FtcGxlIjoiIyBJc29sYXRlIHRoZSBgZGlnaXRzYCBkYXRhXG5kaWdpdHNfZGF0YSA9IGRpZ2l0cy5kYXRhXG5cbiMgSW5zcGVjdCB0aGUgc2hhcGVcbnByaW50KGRpZ2l0c19kYXRhLnNoYXBlKVxuXG4jIElzb2xhdGUgdGhlIHRhcmdldCB2YWx1ZXMgd2l0aCBgdGFyZ2V0YFxuZGlnaXRzX3RhcmdldCA9IGRpZ2l0cy5fX19fX19cblxuIyBJbnNwZWN0IHRoZSBzaGFwZVxucHJpbnQoZGlnaXRzX3RhcmdldC5fX19fXylcblxuIyBQcmludCB0aGUgbnVtYmVyIG9mIHVuaXF1ZSBsYWJlbHNcbm51bWJlcl9kaWdpdHMgPSBsZW4obnAudW5pcXVlKGRpZ2l0cy50YXJnZXQpKVxuXG4jIElzb2xhdGUgdGhlIGBpbWFnZXNgXG5kaWdpdHNfaW1hZ2VzID0gZGlnaXRzLmltYWdlc1xuXG4jIEluc3BlY3QgdGhlIHNoYXBlXG5wcmludChkaWdpdHNfaW1hZ2VzLnNoYXBlKSIsInNvbHV0aW9uIjoiIyBJc29sYXRlIHRoZSBgZGlnaXRzYCBkYXRhXG5kaWdpdHNfZGF0YSA9IGRpZ2l0cy5kYXRhXG5cbiMgSW5zcGVjdCB0aGUgc2hhcGVcbnByaW50KGRpZ2l0c19kYXRhLnNoYXBlKVxuXG4jIElzb2xhdGUgdGhlIHRhcmdldCB2YWx1ZXMgd2l0aCBgdGFyZ2V0YFxuZGlnaXRzX3RhcmdldCA9IGRpZ2l0cy50YXJnZXRcblxuIyBJbnNwZWN0IHRoZSBzaGFwZVxucHJpbnQoZGlnaXRzX3RhcmdldC5zaGFwZSlcblxuIyBQcmludCB0aGUgbnVtYmVyIG9mIHVuaXF1ZSBsYWJlbHNcbm51bWJlcl9kaWdpdHMgPSBsZW4obnAudW5pcXVlKGRpZ2l0cy50YXJnZXQpKVxuXG4jIElzb2xhdGUgdGhlIGBpbWFnZXNgXG5kaWdpdHNfaW1hZ2VzID0gZGlnaXRzLmltYWdlc1xuXG4jIEluc3BlY3QgdGhlIHNoYXBlXG5wcmludChkaWdpdHNfaW1hZ2VzLnNoYXBlKSIsInNjdCI6Im1zZ19kYXRhPVwiRGlkIHlvdSBhZGQgYHNoYXBlYCB0byBnZXQgdGhlIG51bWJlciBvZiBkaW1lbnNpb25zIGFuZCBpdGVtcyBvZiB0aGUgYGRpZ2l0c19kYXRhYCBhcnJheT9cIlxubXNnX3RhcmdldD1cIkRpZCB5b3UgYWRkIGBzaGFwZWAgdG8gZ2V0IHRoZSBudW1iZXIgb2YgZGltZW5zaW9ucyBhbmQgaXRlbXMgb2YgdGhlIGBkaWdpdHNfdGFyZ2V0YCBhcnJheT9cIlxubXNnX2ltYWdlPVwiRGlkIHlvdSBhZGQgYHNoYXBlYCB0byBnZXQgdGhlIG51bWJlciBvZiBkaW1lbnNpb25zIGFuZCBpdGVtcyBvZiB0aGUgYGRpZ2l0c19pbWFnZXNgIGFycmF5P1wiXG4jIFRlc3Qgb2JqZWN0IGBkaWdpdHNfZGF0YWBcbnRlc3Rfb2JqZWN0KFwiZGlnaXRzX2RhdGFcIiwgdW5kZWZpbmVkX21zZz1cIkRpZCB5b3UgZGVmaW5lIHRoZSBgZGlnaXRzX2RhdGFgIG9iamVjdD9cIiwgaW5jb3JyZWN0X21zZz1cIkRpZCB5b3UgdXNlIHRoZSBgZGF0YWAgYXR0cmlidXRlIHRvIGlzb2xhdGUgdGhlIGRhdGEgb2YgYGRpZ2l0c2A/XCIpXG4jIFRlc3Qgb2JqZWN0IGBkaWdpdHNfdGFyZ2V0YFxudGVzdF9vYmplY3QoXCJkaWdpdHNfdGFyZ2V0XCIsIHVuZGVmaW5lZF9tc2c9XCJEaWQgeW91IGRlZmluZSB0aGUgYGRpZ2l0c190YXJnZXRgIG9iamVjdD9cIiwgaW5jb3JyZWN0X21zZz1cIkRpZCB5b3UgdXNlIHRoZSBgdGFyZ2V0YCBhdHRyaWJ1dGUgdG8gaXNvbGF0ZSB0aGUgdGFyZ2V0IHZhbHVlcyBvZiB0aGUgYGRpZ2l0c2AgZGF0YT9cIilcbiMgVGVzdCBgc2hhcGVgIG9mIGBkaWdpdHNfZGF0YWBcbiN0ZXN0IGZ1bmN0aW9uIHByaW50XG50ZXN0X2Z1bmN0aW9uKFxuICAgIFwicHJpbnRcIixcbiAgICAxLFxuICAgIG5vdF9jYWxsZWRfbXNnPVwiRGlkIHlvdSBwcmludCBvdXQgdGhlIHNoYXBlIG9mIHRoZWRhdGE/XCIsXG4gICAgaW5jb3JyZWN0X21zZz1cIkRvbid0IGZvcmdldCB0byBwcmludCBvdXQgdGhlIHNoYXBlIG9mIHRoZSBkYXRhIVwiLFxuICAgIGRvX2V2YWw9RmFsc2VcbilcbnRlc3Rfb2JqZWN0X2FjY2Vzc2VkKFwiZGlnaXRzX2RhdGEuc2hhcGVcIiwgbm90X2FjY2Vzc2VkX21zZz1tc2dfZGF0YSlcbiMgVGVzdCBgcHJpbnRgXG50ZXN0X2Z1bmN0aW9uKFxuICAgIFwicHJpbnRcIixcbiAgICAyLFxuICAgIG5vdF9jYWxsZWRfbXNnPVwiRGlkIHlvdSBwcmludCBvdXQgdGhlIHNoYXBlIG9mIHRoZSB0YXJnZXQgdmFsdWVzIG9mIHRoZSBkYXRhP1wiLFxuICAgIGluY29ycmVjdF9tc2c9XCJEb24ndCBmb3JnZXQgdG8gcHJpbnQgb3V0IHRoZSBzaGFwZSBvZiB0aGUgdGFyZ2V0IHZhbHVlcyBvZiB0aGUgZGF0YSFcIixcbiAgICBkb19ldmFsPUZhbHNlXG4pXG4jIFRlc3QgYWNjZXNzIGBzaGFwZWAgb2YgYGRpZ2l0c190YXJnZXRgXG50ZXN0X29iamVjdF9hY2Nlc3NlZChcImRpZ2l0c190YXJnZXQuc2hhcGVcIiwgbm90X2FjY2Vzc2VkX21zZz1tc2dfdGFyZ2V0KVxuIyBUZXN0IG9iamVjdCBgbnVtYmVyX2RpZ2l0c2BcbnRlc3Rfb2JqZWN0KFwibnVtYmVyX2RpZ2l0c1wiLCB1bmRlZmluZWRfbXNnPVwiRGlkIHlvdSBkZWZpbmUgdGhlIGBudW1iZXJfZGlnaXRzYCBvYmplY3Q/XCIsIGluY29ycmVjdF9tc2c9XCJEaWQgeW91IHVzZSBgbnAudW5pcXVlKClgIHRvIGdpdmUgYmFjayB0aGUgdW5pcXVlIHRhcmdldCB2YWx1ZXM/IERvbid0IGZvcmdldCB0byBnaXZlIGJhY2sgdGhlIGxlbmd0aCBvZiB0aGlzIGFycmF5IHdpdGggYGxlbigpYCFcIilcbiMgVGVzdCBvYmplY3QgYGRpZ2l0c19pbWFnZXNgXG50ZXN0X29iamVjdChcImRpZ2l0c19pbWFnZXNcIiwgdW5kZWZpbmVkX21zZz1cIkRpZCB5b3UgZGVmaW5lIHRoZSBgZGlnaXRzX2ltYWdlc2Agb2JqZWN0P1wiLCBpbmNvcnJlY3RfbXNnPVwiRGlkIHlvdSB1c2UgdGhlIGBpbWFnZXNgIGF0dHJpYnV0ZSB0byBpc29sYXRlIHRoZSBpbWFnZXMgb2YgdGhlIGBkaWdpdHNgIGRhdGE/XCIpXG4jIFRlc3QgYHNoYXBlYCBvZiBgZGlnaXRzX2ltYWdlc2BcbnRlc3Rfb2JqZWN0X2FjY2Vzc2VkKFwiZGlnaXRzX2ltYWdlcy5zaGFwZVwiLCBub3RfYWNjZXNzZWRfbXNnPW1zZ19pbWFnZSlcbiMgVGVzdCBgcHJpbnRgIFxudGVzdF9mdW5jdGlvbihcbiAgICBcInByaW50XCIsXG4gICAgMyxcbiAgICBub3RfY2FsbGVkX21zZz1cIkRpZCB5b3UgcHJpbnQgb3V0IHRoZSBzaGFwZSBvZiB0aGUgaW1hZ2VzIG9mIGBkaWdpdHNgP1wiLFxuICAgIGluY29ycmVjdF9tc2c9XCJEb24ndCBmb3JnZXQgdG8gcHJpbnQgb3V0IHRoZSBzaGFwZSBvZiB0aGUgaW1hZ2VzIG9mIGBkaWdpdHNgIVwiLFxuICAgIGRvX2V2YWw9RmFsc2VcbilcbnN1Y2Nlc3NfbXNnKFwiV2VsbCBkb25lIVwiKSJ9 我們在這小結一下,檢視digits.data.shape可以得知資料有1,797個觀測值,64個變數,檢視digits.target.shape可以得知資料有1,797個目標值(或稱標籤值),而檢視len(np.unique(digits.target))可以得知所有的目標值只有10個相異值:0到9,意即我們的模型是要辨識手寫數字圖片是0到9中的哪一個數字。

最後是digits.images的三個維度:1,797個8x8像素的矩陣,我們可以進一步將digits.images轉換(reshape)為兩個維度,並且使用numpy的all()方法比較陣列內的元素是否與digits.data完全相同:print(np.all(digits.images.reshape((1797,64))==digits.data)),而我們會得到True的結果。

使用matplotlib視覺化手寫數字圖片 接下來我們要使用Python的資料視覺化套件matplotlib來視覺化這些手寫數字圖片: #從`sklearn`載入`datasets` fromsklearnimportdatasets #載入matplotlib importmatplotlib.pyplotasplt #載入`digits` digits=datasets.load_digits() #設定圖形的大小(寬,高) fig=plt.figure(figsize=(4,2)) #調整子圖形 fig.subplots_adjust(left=0,right=1,bottom=0,top=1,hspace=0.05,wspace=0.05) #把前8個手寫數字顯示在子圖形 foriinrange(8): #在2x4網格中第i+1個位置繪製子圖形,並且關掉座標軸刻度 ax=fig.add_subplot(2,4,i+1,xticks=[],yticks=[]) #顯示圖形,色彩選擇灰階 ax.imshow(digits.images[i],cmap=plt.cm.binary) #在左下角標示目標值 ax.text(0,7,str(digits.target[i])) #顯示圖形 plt.show() 這段程式看起來有點難懂,讓我們分開來看: 載入matplotlib套件。

設定一個長2吋,寬4吋的空白畫布,準備待會將子圖形畫在上面。

調整子圖形的一些參數。

使用一個for迴圈開始要將空白畫布填滿。

初始化8個子圖形,並依序填入2x4網格中的每一格。

最後畫龍點睛的部分是在每個子圖形(0,7)的位置(左下角)顯示目標值。

別忘了使用plt.show()將畫好的圖顯示出來! 完成後我們可以看到這張視覺化圖形 或者採取這段較簡潔的程式: #從`sklearn`載入`datasets` fromsklearnimportdatasets #載入matplotlib importmatplotlib.pyplotasplt #載入`digits` digits=datasets.load_digits() #將觀測值與目標值放入一個list images_and_labels=list(zip(digits.images,digits.target)) #list中的每個元素 fori,(image,label)inenumerate(images_and_labels[:8]): #在i+1的位置初始化子圖形 plt.subplot(2,4,i+1) #關掉子圖形座標軸刻度 plt.axis('off') #顯示圖形,色彩選擇灰階 plt.imshow(image,cmap=plt.cm.binary) #加入子圖形的標題 plt.title('Training:'+str(label)) #顯示圖形 plt.show() 完成後我們可以看到這張視覺化圖形: 在這個例子中,我們將兩個陣列存入images_and_labels這個變數,然後將這個變數中的前8個元素(包含digits.images與相對應的digits.target)在一個2x4的格線上繪製子圖形,並且使用plt.cm.binary這個灰階色彩,搭配子圖形標題顯示出來。

經過這兩個視覺化練習之後,您應該對目前手上處理的digits資料有更深的認識! 視覺化:主成份分析(PrincipalComponentAnalysis,PCA) digits資料有64個變數,面對這種高維度的資料(實務上還有其他很多像是財務或者氣候資料也都屬於高維度資料),我們需要用一些方法找出特別重要的二到三個變數,或者將許多的變數組合成讓我們更容易理解且視覺化的幾個維度。

這種方法稱作降維(DimensionalityReduction),我們接著要使用其中一種方法稱為:主成份分析(PrincipalComponentAnalysis,PCA)來協助我們視覺化digits資料。

主成份分析的精神在於找出變數之間的線性關係組成新的一個主成份,然後使用這個主成份取代原有的變數,屬於一種最大化資料變異性的線性轉換方法,如果您想了解更多,可以參閱這個連結。

我們可以透過scikit-learn輕鬆對digits資料實作主成份分析,直接點選Run觀察結果: eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiZnJvbSBza2xlYXJuIGltcG9ydCBkYXRhc2V0c1xuZGlnaXRzID0gZGF0YXNldHMubG9hZF9kaWdpdHMoKVxuZnJvbSBza2xlYXJuLmRlY29tcG9zaXRpb24gaW1wb3J0IFJhbmRvbWl6ZWRQQ0FcbmZyb20gc2tsZWFybi5kZWNvbXBvc2l0aW9uIGltcG9ydCBQQ0FcbmltcG9ydCBudW1weSBhcyBucCIsInNhbXBsZSI6IiMgQ3JlYXRlIGEgUmFuZG9taXplZCBQQ0EgbW9kZWwgdGhhdCB0YWtlcyB0d28gY29tcG9uZW50c1xucmFuZG9taXplZF9wY2EgPSBSYW5kb21pemVkUENBKG5fY29tcG9uZW50cz0yKVxuXG4jIEZpdCBhbmQgdHJhbnNmb3JtIHRoZSBkYXRhIHRvIHRoZSBtb2RlbFxucmVkdWNlZF9kYXRhX3JwY2EgPSByYW5kb21pemVkX3BjYS5maXRfdHJhbnNmb3JtKGRpZ2l0cy5kYXRhKVxuXG4jIENyZWF0ZSBhIHJlZ3VsYXIgUENBIG1vZGVsIFxucGNhID0gUENBKG5fY29tcG9uZW50cz0yKVxuXG4jIEZpdCBhbmQgdHJhbnNmb3JtIHRoZSBkYXRhIHRvIHRoZSBtb2RlbFxucmVkdWNlZF9kYXRhX3BjYSA9IHBjYS5maXRfdHJhbnNmb3JtKGRpZ2l0cy5kYXRhKVxuXG4jIEluc3BlY3QgdGhlIHNoYXBlXG5wcmludChcIlNoYXBlIG9mIHJlZHVjZWRfZGF0YV9wY2E6XCIsIHJlZHVjZWRfZGF0YV9wY2Euc2hhcGUpXG5wcmludChcIi0tLVwiKVxuXG4jIFByaW50IG91dCB0aGUgZGF0YVxucHJpbnQoXCJSUENBOlwiKVxucHJpbnQocmVkdWNlZF9kYXRhX3JwY2EpXG5wcmludChcIi0tLVwiKVxucHJpbnQoXCJQQ0E6XCIpXG5wcmludChyZWR1Y2VkX2RhdGFfcGNhKSIsInNvbHV0aW9uIjoiIyBDcmVhdGUgYSBSYW5kb21pemVkIFBDQSBtb2RlbCB0aGF0IHRha2VzIHR3byBjb21wb25lbnRzXG5yYW5kb21pemVkX3BjYSA9IFJhbmRvbWl6ZWRQQ0Eobl9jb21wb25lbnRzPTIpXG5cbiMgRml0IGFuZCB0cmFuc2Zvcm0gdGhlIGRhdGEgdG8gdGhlIG1vZGVsXG5yZWR1Y2VkX2RhdGFfcnBjYSA9IHJhbmRvbWl6ZWRfcGNhLmZpdF90cmFuc2Zvcm0oZGlnaXRzLmRhdGEpXG5cbiMgQ3JlYXRlIGEgcmVndWxhciBQQ0EgbW9kZWwgXG5wY2EgPSBQQ0Eobl9jb21wb25lbnRzPTIpXG5cbiMgRml0IGFuZCB0cmFuc2Zvcm0gdGhlIGRhdGEgdG8gdGhlIG1vZGVsXG5yZWR1Y2VkX2RhdGFfcGNhID0gcGNhLmZpdF90cmFuc2Zvcm0oZGlnaXRzLmRhdGEpXG5cbiMgSW5zcGVjdCB0aGUgc2hhcGVcbnByaW50KFwiU2hhcGUgb2YgcmVkdWNlZF9kYXRhX3BjYTpcIiwgcmVkdWNlZF9kYXRhX3BjYS5zaGFwZSlcbnByaW50KFwiLS0tXCIpXG5cbiMgUHJpbnQgb3V0IHRoZSBkYXRhXG5wcmludChcIlJQQ0E6XCIpXG5wcmludChyZWR1Y2VkX2RhdGFfcnBjYSlcbnByaW50KFwiLS0tXCIpXG5wcmludChcIlBDQTpcIilcbnByaW50KHJlZHVjZWRfZGF0YV9wY2EpIiwic2N0IjoidGVzdF9vYmplY3QoXCJyYW5kb21pemVkX3BjYVwiLCBkb19ldmFsPUZhbHNlKVxudGVzdF9vYmplY3QoXCJyZWR1Y2VkX2RhdGFfcnBjYVwiLCBkb19ldmFsPUZhbHNlKVxudGVzdF9vYmplY3QoXCJwY2FcIiwgZG9fZXZhbD1GYWxzZSlcbnRlc3Rfb2JqZWN0KFwicmVkdWNlZF9kYXRhX3BjYVwiLCBkb19ldmFsPUZhbHNlKVxucHJlZGVmX21zZz1cIkRpZCB5b3UgaW5zcGVjdCB0aGUgc2hhcGUgb2YgYHJlZHVjZWRfZGF0YV9wY2FgP1wiXG50ZXN0X29iamVjdF9hY2Nlc3NlZChcInJlZHVjZWRfZGF0YV9wY2Euc2hhcGVcIiwgbm90X2FjY2Vzc2VkX21zZz1wcmVkZWZfbXNnKVxuIyBUZXN0IGBwcmludGAgXG50ZXN0X2Z1bmN0aW9uKFxuICAgIFwicHJpbnRcIixcbiAgICAxLFxuICAgIG5vdF9jYWxsZWRfbXNnPVwiRGlkIHlvdSBwcmludCBvdXQgdGhlIGByZWR1Y2VkX2RhdGFfcnBjYWAgZGF0YT9cIixcbiAgICBpbmNvcnJlY3RfbXNnPVwiRG9uJ3QgZm9yZ2V0IHRvIHByaW50IG91dCB0aGUgYHJlZHVjZWRfZGF0YV9ycGNhYCBkYXRhIVwiLFxuICAgIGRvX2V2YWw9RmFsc2VcbilcbnRlc3RfZnVuY3Rpb24oXG4gICAgXCJwcmludFwiLFxuICAgIDIsXG4gICAgbm90X2NhbGxlZF9tc2c9XCJEaWQgeW91IHByaW50IG91dCB0aGUgYHJlZHVjZWRfZGF0YV9wY2FgIGRhdGE/XCIsXG4gICAgaW5jb3JyZWN0X21zZz1cIkRvbid0IGZvcmdldCB0byBwcmludCBvdXQgdGhlIGByZWR1Y2VkX2RhdGFfcGNhYCBkYXRhIVwiLFxuICAgIGRvX2V2YWw9RmFsc2VcbilcbnN1Y2Nlc3NfbXNnKFwiQW1hemluZyFcIikifQ== 秘訣:比較RandomizedPCA()與PCA()產生的結果,觀察兩者的差異。

我們在程式中指定降維成兩個主成份,確保可以使用散佈圖視覺化,並觀察用兩個主成份是否可以將不同的目標值區隔開: fromsklearnimportdatasets digits=datasets.load_digits() importmatplotlib.pyplotasplt colors=['black','blue','purple','yellow','white','red','lime','cyan','orange','gray'] foriinrange(len(colors)): x=reduced_data_rpca[:,0][digits.target==i] y=reduced_data_rpca[:,1][digits.target==i] plt.scatter(x,y,c=colors[i]) plt.legend(digits.target_names,bbox_to_anchor=(1.05,1),loc=2,borderaxespad=0.) plt.xlabel('FirstPrincipalComponent') plt.ylabel('SecondPrincipalComponent') plt.title("PCAScatterPlot") plt.show() 完成後我們可以看到這張視覺化圖形: 我們再一次使用了matplotlib來繪圖,假如您正在製作自己的資料科學作品集,也許可以考慮更高階且更美觀的繪圖套件。

另外如果您使用JupyterNotebook進行開發,不需要執行plt.show(),如果您對JupyterNotebook有興趣可以參閱DefinitiveGuidetoJupyterNotebook。

讓我們分開來看前一段程式碼: 將不同的顏色儲存在一個list中。

由於相異的目標值有10個(0到9),所以我們指定了10種不同的顏色來標示。

設定x軸與y軸。

分別選出reduced_data_rpca的第一欄與第二欄,根據不同的目標值選出對應的觀測值,意即當迴圈開始的時候,會選出目標值為0的觀測值,接著是目標值為1的觀測值,依此類推。

畫出散佈圖。

當迴圈開始的時候,會將目標值為0的觀測值用黑色(black)畫出,接著是將目標值為1的觀測值用藍色(blue)畫出,依此類推。

使用target_names鍵值在散佈圖旁邊加上圖例。

加入圖形標題與座標軸標籤。

顯示圖形。

下一步呢? 在對資料本身有了一定認知後,我們必須思索的是如何應用,以及使用什麼樣的機器學習演算法來建立預測模型。

秘訣:對資料的認知程度愈高,愈容易找到應用與合適的機器學習演算法。

然而對於scikit-learn的初學者來說,這個套件的內容有點過於龐大,這時您可以參考scikit-learn機器學習地圖來獲得額外的幫助。

我們想要對digits資料使用非監督式學習演算法,在這個機器學習地圖上我們沿著資料超過50個觀測值(確認!)、預測類別(確認!)、沒有目標值(只要不使用digits.target即可,確認!)、需要知道有幾個類別要預測(確認!)以及需要小於1萬個觀測值(確認!),我們可以順利應用K-Means! 但是K-Means演算法究竟是什麼?K-Means演算法是最簡單且最廣泛被運用來解決分群問題的非監督式學習演算法。

演算法首先隨意設定k個中心點,然後計算各個觀測值與這k個中心點的距離,然後將觀測值分配給距離最近的中心點貼上標籤,形成k個群集。

接著這k個中心點的位置會被重新計算並移動到各個群集目前的中心,然後再重新計算各個觀測值與這k個中心點的距離,更新各個觀測值的群集標籤。

前述的流程會重複進行,一直到各個觀測值的群集標籤穩定不更動為止。

k值由使用者指定,而一開始這k個中心點的位置則是隨機擺放,這些隨機擺放的位置會影響K-Means演算法的結果,可以透過設定n-init參數來處理這個問題。

資料的預處理 在開始使用K-Means演算法之前,我們應該先學習關於資料的預處理(Preprocessing)。

資料的標準化 我們使用sklearn.preprocessing模組的scale()方法將digits資料作標準化,試著依照註解的提示完成程式後點選Run觀察結果,如果沒有頭緒,可以點選Solution將程式完成後再點選Run觀察結果: eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiZnJvbSBza2xlYXJuIGltcG9ydCBkYXRhc2V0c1xuZGlnaXRzID0gZGF0YXNldHMubG9hZF9kaWdpdHMoKSIsInNhbXBsZSI6IiMgSW1wb3J0XG5mcm9tIHNrbGVhcm4ucHJlcHJvY2Vzc2luZyBpbXBvcnQgc2NhbGVcblxuIyBBcHBseSBgc2NhbGUoKWAgdG8gdGhlIGBkaWdpdHNgIGRhdGFcbmRhdGEgPSBfX19fXyhkaWdpdHMuZGF0YSkiLCJzb2x1dGlvbiI6IiMgSW1wb3J0XG5mcm9tIHNrbGVhcm4ucHJlcHJvY2Vzc2luZyBpbXBvcnQgc2NhbGVcblxuIyBBcHBseSBgc2NhbGUoKWAgdG8gdGhlIGBkaWdpdHNgIGRhdGFcbmRhdGEgPSBzY2FsZShkaWdpdHMuZGF0YSkiLCJzY3QiOiJ0ZXN0X2Z1bmN0aW9uKFxuICAgIFwic2tsZWFybi5wcmVwcm9jZXNzaW5nLnNjYWxlXCIsXG4gICAgbm90X2NhbGxlZF9tc2c9XCJEaWQgeW91IHN0YW5kYXJkaXplIHRoZSBgZGlnaXRzYCBkYXRhP1wiLFxuICAgIGluY29ycmVjdF9tc2c9XCJEb24ndCBmb3JnZXQgdG8gc3RhbmRhcmRpemUgdGhlIGBkaWdpdHNgIGRhdGEgd2l0aCBgc2NhbGUoKWAhXCIsXG4gICAgZG9fZXZhbD1GYWxzZVxuKVxuc3VjY2Vzc19tc2coXCJBd2Vzb21lIVwiKSJ9 透過標準化,我們將這64個維度的分佈轉換為平均數為0,標準差為1的標準常態分佈。

將資料切分為訓練與測試資料 為了之後要評估模型的表現,我們也需要將資料切分為訓練與測試資料,訓練資料是用來建立模型,測試資料則用來評估模型。

實務中兩個資料不會有交集,常見的切分比例是2/3作為訓練資料,1/3作為測試資料。

在接下來的程式我們將train_test_split()方法中的test_size參數設為0.25,另外一個參數random_state設為42用來確保每次切分資料的結果都相同,如果您希望產出可複製的結果,這是非常實用的技巧。

試著依照註解的提示完成程式後點選Run觀察結果,如果沒有頭緒,可以點選Solution將程式完成後再點選Run觀察結果: eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiZnJvbSBza2xlYXJuIGltcG9ydCBkYXRhc2V0c1xuZGlnaXRzID0gZGF0YXNldHMubG9hZF9kaWdpdHMoKVxuZnJvbSBza2xlYXJuLnByZXByb2Nlc3NpbmcgaW1wb3J0IHNjYWxlXG5kYXRhID0gc2NhbGUoZGlnaXRzLmRhdGEpIiwic2FtcGxlIjoiIyBJbXBvcnQgYHRyYWluX3Rlc3Rfc3BsaXRgXG5mcm9tIHNrbGVhcm4uY3Jvc3NfdmFsaWRhdGlvbiBpbXBvcnQgX19fX19fX19fX19fX19fX1xuXG4jIFNwbGl0IHRoZSBgZGlnaXRzYCBkYXRhIGludG8gdHJhaW5pbmcgYW5kIHRlc3Qgc2V0c1xuWF90cmFpbiwgWF90ZXN0LCB5X3RyYWluLCB5X3Rlc3QsIGltYWdlc190cmFpbiwgaW1hZ2VzX3Rlc3QgPSB0cmFpbl90ZXN0X3NwbGl0KGRhdGEsIGRpZ2l0cy50YXJnZXQsIGRpZ2l0cy5pbWFnZXMsIHRlc3Rfc2l6ZT0wLjI1LCByYW5kb21fc3RhdGU9NDIpIiwic29sdXRpb24iOiIjIEltcG9ydCBgdHJhaW5fdGVzdF9zcGxpdGBcbmZyb20gc2tsZWFybi5jcm9zc192YWxpZGF0aW9uIGltcG9ydCB0cmFpbl90ZXN0X3NwbGl0XG5cbiMgU3BsaXQgdGhlIGBkaWdpdHNgIGRhdGEgaW50byB0cmFpbmluZyBhbmQgdGVzdCBzZXRzXG5YX3RyYWluLCBYX3Rlc3QsIHlfdHJhaW4sIHlfdGVzdCwgaW1hZ2VzX3RyYWluLCBpbWFnZXNfdGVzdCA9IHRyYWluX3Rlc3Rfc3BsaXQoZGF0YSwgZGlnaXRzLnRhcmdldCwgZGlnaXRzLmltYWdlcywgdGVzdF9zaXplPTAuMjUsIHJhbmRvbV9zdGF0ZT00MikiLCJzY3QiOiJpbXBvcnRfbXNnPVwiRGlkIHlvdSBpbXBvcnQgYHRyYWluX3Rlc3Rfc3BsaXRgIGZyb20gYHNrbGVhcm4uY3Jvc3NfdmFsaWRhdGlvbmA/XCJcbnByZWRlZl9tc2c9XCJEb24ndCBmb3JnZXQgdG8gZmlsbCBpbiBgdHJhaW5fdGVzdF9zcGxpdGAhXCJcbnRlc3RfaW1wb3J0KFwic2tsZWFybi5jcm9zc192YWxpZGF0aW9uLnRyYWluX3Rlc3Rfc3BsaXRcIiwgc2FtZV9hcyA9IFRydWUsIG5vdF9pbXBvcnRlZF9tc2cgPSBpbXBvcnRfbXNnLCBpbmNvcnJlY3RfYXNfbXNnID0gcHJlZGVmX21zZylcbnRlc3Rfb2JqZWN0KFwiWF90cmFpblwiLCBkb19ldmFsPUZhbHNlLCAgdW5kZWZpbmVkX21zZz1cIkRpZCB5b3UgbGVhdmUgb3V0IGBYX3RyYWluYCBvciBhbnkgb2YgdGhlIG90aGVyIHZhcmlhYmxlcz9cIilcbnRlc3Rfb2JqZWN0KFwiWF90ZXN0XCIsIGRvX2V2YWw9RmFsc2UsIHVuZGVmaW5lZF9tc2c9XCJEaWQgeW91IGRlZmluZSBgWF90ZXN0YD9cIilcbnRlc3Rfb2JqZWN0KFwieV90cmFpblwiLCBkb19ldmFsPUZhbHNlLCB1bmRlZmluZWRfbXNnPVwiRGlkIHlvdSBkZWZpbmUgYHlfdHJhaW5gP1wiKVxudGVzdF9vYmplY3QoXCJ5X3Rlc3RcIiwgZG9fZXZhbD1GYWxzZSwgdW5kZWZpbmVkX21zZz1cIkRpZCB5b3UgZGVmaW5lIGB5X3Rlc3RgP1wiKVxudGVzdF9vYmplY3QoXCJpbWFnZXNfdHJhaW5cIiwgZG9fZXZhbD1GYWxzZSwgdW5kZWZpbmVkX21zZz1cIkRpZCB5b3UgZGVmaW5lIGBpbWFnZXNfdHJhaW5gP1wiKVxudGVzdF9vYmplY3QoXCJpbWFnZXNfdGVzdFwiLCBkb19ldmFsPUZhbHNlLCB1bmRlZmluZWRfbXNnPVwiRGlkIHlvdSBkZWZpbmUgYGltYWdlc190ZXN0YD9cIilcbnN1Y2Nlc3NfbXNnKFwiR3JlYXQgam9iIVwiKSJ9 切分完訓練與測試資料之後,我們很快地看一下訓練資料的觀測值與目標值資訊,試著依照註解的提示完成程式後點選Run觀察結果,如果沒有頭緒,可以點選Solution將程式完成後再點選Run觀察結果: eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiZnJvbSBza2xlYXJuIGltcG9ydCBkYXRhc2V0c1xuZnJvbSBza2xlYXJuLmNyb3NzX3ZhbGlkYXRpb24gaW1wb3J0IHRyYWluX3Rlc3Rfc3BsaXRcbmZyb20gc2tsZWFybi5wcmVwcm9jZXNzaW5nIGltcG9ydCBzY2FsZVxuaW1wb3J0IG51bXB5IGFzIG5wXG5kaWdpdHMgPSBkYXRhc2V0cy5sb2FkX2RpZ2l0cygpXG5kYXRhID0gc2NhbGUoZGlnaXRzLmRhdGEpXG5YX3RyYWluLCBYX3Rlc3QsIHlfdHJhaW4sIHlfdGVzdCwgaW1hZ2VzX3RyYWluLCBpbWFnZXNfdGVzdCA9IHRyYWluX3Rlc3Rfc3BsaXQoZGF0YSwgZGlnaXRzLnRhcmdldCwgZGlnaXRzLmltYWdlcywgdGVzdF9zaXplPTAuMjUsIHJhbmRvbV9zdGF0ZT00MikiLCJzYW1wbGUiOiIjIE51bWJlciBvZiB0cmFpbmluZyBmZWF0dXJlc1xubl9zYW1wbGVzLCBuX2ZlYXR1cmVzID0gWF90cmFpbi5zaGFwZVxuXG4jIFByaW50IG91dCBgbl9zYW1wbGVzYFxucHJpbnQoX19fX19fX19fKVxuXG4jIFByaW50IG91dCBgbl9mZWF0dXJlc2BcbnByaW50KF9fX19fX19fX18pXG5cbiMgTnVtYmVyIG9mIFRyYWluaW5nIGxhYmVsc1xubl9kaWdpdHMgPSBsZW4obnAudW5pcXVlKHlfdHJhaW4pKVxuXG4jIEluc3BlY3QgYHlfdHJhaW5gXG5wcmludChsZW4oX19fX19fXykpIiwic29sdXRpb24iOiIjIE51bWJlciBvZiB0cmFpbmluZyBmZWF0dXJlc1xubl9zYW1wbGVzLCBuX2ZlYXR1cmVzID0gWF90cmFpbi5zaGFwZVxuXG4jIFByaW50IG91dCBgbl9zYW1wbGVzYFxucHJpbnQobl9zYW1wbGVzKVxuXG4jIFByaW50IG91dCBgbl9mZWF0dXJlc2BcbnByaW50KG5fZmVhdHVyZXMpXG5cbiMgTnVtYmVyIG9mIFRyYWluaW5nIGxhYmVsc1xubl9kaWdpdHMgPSBsZW4obnAudW5pcXVlKHlfdHJhaW4pKVxuXG4jIEluc3BlY3QgYHlfdHJhaW5gXG5wcmludChsZW4oeV90cmFpbikpIiwic2N0IjoidGVzdF9vYmplY3QoXCJuX3NhbXBsZXNcIiwgdW5kZWZpbmVkX21zZz1cImRpZCB5b3UgbGVhdmUgb3V0IGBuX3NhbXBsZXNgIG9yIGBuX2ZlYXR1cmVzYD9cIilcbnRlc3Rfb2JqZWN0KFwibl9mZWF0dXJlc1wiKVxudGVzdF9mdW5jdGlvbihcbiAgICBcInByaW50XCIsXG4gICAgMSxcbiAgICBub3RfY2FsbGVkX21zZz1cIkRpZCB5b3UgcHJpbnQgb3V0IHRoZSBudW1iZXIgb2Ygc2FtcGxlcyBvZiB0aGUgYGRpZ2l0c2AgdHJhaW5pbmcgZGF0YT9cIixcbiAgICBpbmNvcnJlY3RfbXNnPVwiRG9uJ3QgZm9yZ2V0IHRvIHByaW50IG91dCB0aGUgbnVtYmVyIG9mIHNhbXBsZXMhXCIsXG4gICAgZG9fZXZhbD1GYWxzZVxuKVxudGVzdF9mdW5jdGlvbihcbiAgICBcInByaW50XCIsXG4gICAgMixcbiAgICBub3RfY2FsbGVkX21zZz1cIkRpZCB5b3UgcHJpbnQgb3V0IHRoZSBudW1iZXIgb2YgZmVhdHVyZXMgb2YgdGhlIGBkaWdpdHNgIHRyYWluaW5nIGRhdGE/XCIsXG4gICAgaW5jb3JyZWN0X21zZz1cIkRvbid0IGZvcmdldCB0byBwcmludCBvdXQgdGhlIG51bWJlciBvZiBmZWF0dXJlcyFcIixcbiAgICBkb19ldmFsPUZhbHNlXG4pXG50ZXN0X29iamVjdChcIm5fZGlnaXRzXCIsIGluY29ycmVjdF9tc2c9XCJkaWQgeW91IGRlZmluZSBgbl9kaWdpdHNgIGNvcnJlY3RseT9cIilcbnRlc3RfZnVuY3Rpb24oXG4gICAgXCJwcmludFwiLFxuICAgIDMsXG4gICAgbm90X2NhbGxlZF9tc2c9XCJEaWQgeW91IHByaW50IG91dCB0aGUgbnVtYmVyIG9mIHRyYWluaW5nIGxhYmVscyBmb3IgdGhlIGBkaWdpdHNgIGRhdGE/XCIsXG4gICAgaW5jb3JyZWN0X21zZz1cIkRvbid0IGZvcmdldCB0byBwcmludCBvdXQgdGhlIG51bWJlciBvZiB0cmFpbmluZyBsYWJlbHMgd2l0aCBgbGVuKHlfdHJhaW4pYCFcIixcbiAgICBkb19ldmFsPUZhbHNlXG4pXG5zdWNjZXNzX21zZyhcIldlbGwgZG9uZSFcIikifQ== 現在訓練資料X_train有1347個觀測值,y_train有1347個目標值,恰好是原始資料digits的2/3;而X_test有450個觀測值,y_test有450個目標值,恰好是原始資料digits的1/3。

digits資料的分群 經過標準化與切分之後,我們就可以開始使用K-Means演算法,透過cluster模組的KMeans()方法來建立模型,在這裡要注意三個參數:init、n_clusters與random_state。

您一定還記得random_state,這個參數能夠確保我們每次執行這段程式得到的結果相同,試著依照註解的提示完成程式後點選Run觀察結果,如果沒有頭緒,可以點選Solution將程式完成後再點選Run觀察結果: eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiZnJvbSBza2xlYXJuIGltcG9ydCBkYXRhc2V0c1xuZnJvbSBza2xlYXJuLmNyb3NzX3ZhbGlkYXRpb24gaW1wb3J0IHRyYWluX3Rlc3Rfc3BsaXRcbmZyb20gc2tsZWFybi5wcmVwcm9jZXNzaW5nIGltcG9ydCBzY2FsZVxuaW1wb3J0IG51bXB5IGFzIG5wXG5kaWdpdHMgPSBkYXRhc2V0cy5sb2FkX2RpZ2l0cygpXG5kYXRhID0gc2NhbGUoZGlnaXRzLmRhdGEpXG5YX3RyYWluLCBYX3Rlc3QsIHlfdHJhaW4sIHlfdGVzdCwgaW1hZ2VzX3RyYWluLCBpbWFnZXNfdGVzdCA9IHRyYWluX3Rlc3Rfc3BsaXQoZGF0YSwgZGlnaXRzLnRhcmdldCwgZGlnaXRzLmltYWdlcywgdGVzdF9zaXplPTAuMjUsIHJhbmRvbV9zdGF0ZT00MikiLCJzYW1wbGUiOiIjIEltcG9ydCB0aGUgYGNsdXN0ZXJgIG1vZHVsZVxuZnJvbSBza2xlYXJuIGltcG9ydCBfX19fX19fX1xuXG4jIENyZWF0ZSB0aGUgS01lYW5zIG1vZGVsXG5jbGYgPSBjbHVzdGVyLktNZWFucyhpbml0PSdrLW1lYW5zKysnLCBuX2NsdXN0ZXJzPTEwLCByYW5kb21fc3RhdGU9NDIpXG5cbiMgRml0IHRoZSB0cmFpbmluZyBkYXRhIGBYX3RyYWluYHRvIHRoZSBtb2RlbFxuY2xmLmZpdChfX19fX19fXykiLCJzb2x1dGlvbiI6IiMgSW1wb3J0IHRoZSBgY2x1c3RlcmAgbW9kdWxlXG5mcm9tIHNrbGVhcm4gaW1wb3J0IGNsdXN0ZXJcblxuIyBDcmVhdGUgdGhlIEtNZWFucyBtb2RlbFxuY2xmID0gY2x1c3Rlci5LTWVhbnMoaW5pdD0nay1tZWFucysrJywgbl9jbHVzdGVycz0xMCwgcmFuZG9tX3N0YXRlPTQyKVxuXG4jIEZpdCB0aGUgdHJhaW5pbmcgZGF0YSB0byB0aGUgbW9kZWxcbmNsZi5maXQoWF90cmFpbikiLCJzY3QiOiJpbXBvcnRfbXNnPVwiRGlkIHlvdSBpbXBvcnQgYGNsdXN0ZXJgIGZyb20gYHNrbGVhcm5gP1wiXG5wcmVkZWZfbXNnPVwiRG9uJ3QgZm9yZ2V0IHRvIGltcG9ydCBgY2x1c3RlciBmcm9tIGBza2xlYXJuYCFcIlxudGVzdF9pbXBvcnQoXCJza2xlYXJuLmNsdXN0ZXJcIiwgc2FtZV9hcyA9IFRydWUsIG5vdF9pbXBvcnRlZF9tc2cgPSBpbXBvcnRfbXNnLCBpbmNvcnJlY3RfYXNfbXNnID0gcHJlZGVmX21zZylcbnRlc3Rfb2JqZWN0KFwiY2xmXCIsIGRvX2V2YWw9RmFsc2UsIGluY29ycmVjdF9tc2c9XCJkaWQgY3JlYXRlIHRoZSBLTWVhbnMgbW9kZWwgY29ycmVjdGx5P1wiKVxudGVzdF9mdW5jdGlvbihcImNsZi5maXRcIiwgZG9fZXZhbD1GYWxzZSlcbnN1Y2Nlc3NfbXNnKFwiV29vaG9vIVwiKSJ9 init參數指定我們使用的K-Means演算法k-means++,init參數預設就是使用K-Means演算法,所以這個參數其實是可以省略的。

n_clusters參數被設定為10,這呼應了我們有0到9這10個相異目標值。

假使在未知群集的情況下,通常會嘗試幾個不同的n_clusters參數值,分別計算平方誤差和(SumoftheSquaredErrors,SSE),然後選擇平方誤差和最小的那個n_clusters作為群集數,換句話說就是讓各個群集中的每個觀測值到群集中心點的距離最小化。

再提醒一下,不需要將測試資料放入模型中,測試資料是用來評估模型的表現。

接著我們可以利用將各個群集的中心圖片視覺化: #建立K-Means模型 fromsklearnimportdatasets fromsklearn.cross_validationimporttrain_test_split fromsklearn.preprocessingimportscale importnumpyasnp fromsklearnimportcluster digits=datasets.load_digits() data=scale(digits.data) X_train,X_test,y_train,y_test,images_train,images_test=train_test_split(data,digits.target,digits.images,test_size=0.25,random_state=42) clf=cluster.KMeans(init='k-means++',n_clusters=10,random_state=42) clf.fit(X_train) #載入matplotlib importmatplotlib.pyplotasplt #設定圖形的大小 fig=plt.figure(figsize=(8,3)) #圖形標題 fig.suptitle('ClusterCenterImages',fontsize=14,fontweight='bold') #對所有的目標值(0-9) foriinrange(10): #在2x5的網格上繪製子圖形 ax=fig.add_subplot(2,5,i+1) #顯示圖片 ax.imshow(clf.cluster_centers_[i].reshape((8,8)),cmap=plt.cm.binary) #將座標軸刻度關掉 plt.axis('off') #顯示圖形 plt.show() 接下來我們要預測測試資料的目標值,試著依照註解的提示完成程式後點選Run觀察結果,如果沒有頭緒,可以點選Solution將程式完成後再點選Run觀察結果: eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiZnJvbSBza2xlYXJuIGltcG9ydCBkYXRhc2V0c1xuZnJvbSBza2xlYXJuLmNyb3NzX3ZhbGlkYXRpb24gaW1wb3J0IHRyYWluX3Rlc3Rfc3BsaXRcbmZyb20gc2tsZWFybi5wcmVwcm9jZXNzaW5nIGltcG9ydCBzY2FsZVxuZnJvbSBza2xlYXJuIGltcG9ydCBjbHVzdGVyXG5kaWdpdHMgPSBkYXRhc2V0cy5sb2FkX2RpZ2l0cygpXG5kYXRhID0gc2NhbGUoZGlnaXRzLmRhdGEpXG5YX3RyYWluLCBYX3Rlc3QsIHlfdHJhaW4sIHlfdGVzdCwgaW1hZ2VzX3RyYWluLCBpbWFnZXNfdGVzdCA9IHRyYWluX3Rlc3Rfc3BsaXQoZGF0YSwgZGlnaXRzLnRhcmdldCwgZGlnaXRzLmltYWdlcywgdGVzdF9zaXplPTAuMjUsIHJhbmRvbV9zdGF0ZT00MilcbmNsZiA9IGNsdXN0ZXIuS01lYW5zKGluaXQ9J2stbWVhbnMrKycsIG5fY2x1c3RlcnM9MTAsIHJhbmRvbV9zdGF0ZT00MilcbmNsZi5maXQoWF90cmFpbikiLCJzYW1wbGUiOiIjIFByZWRpY3QgdGhlIGxhYmVscyBmb3IgYFhfdGVzdGBcbnlfcHJlZD1jbGYucHJlZGljdChYX3Rlc3QpXG5cbiMgUHJpbnQgb3V0IHRoZSBmaXJzdCAxMDAgaW5zdGFuY2VzIG9mIGB5X3ByZWRgXG5wcmludCh5X3ByZWRbOjEwMF0pXG5cbiMgUHJpbnQgb3V0IHRoZSBmaXJzdCAxMDAgaW5zdGFuY2VzIG9mIGB5X3Rlc3RgXG5wcmludCh5X3Rlc3RbOjEwMF0pXG5cbiMgU3R1ZHkgdGhlIHNoYXBlIG9mIHRoZSBjbHVzdGVyIGNlbnRlcnNcbmNsZi5jbHVzdGVyX2NlbnRlcnNfLl9fX19fIiwic29sdXRpb24iOiIjIFByZWRpY3QgdGhlIGxhYmVscyBmb3IgYFhfdGVzdGBcbnlfcHJlZD1jbGYucHJlZGljdChYX3Rlc3QpXG5cbiMgUHJpbnQgb3V0IHRoZSBmaXJzdCAxMDAgaW5zdGFuY2VzIG9mIGB5X3ByZWRgXG5wcmludCh5X3ByZWRbOjEwMF0pXG5cbiMgUHJpbnQgb3V0IHRoZSBmaXJzdCAxMDAgaW5zdGFuY2VzIG9mIGB5X3Rlc3RgXG5wcmludCh5X3Rlc3RbOjEwMF0pXG5cbiMgU3R1ZHkgdGhlIHNoYXBlIG9mIHRoZSBjbHVzdGVyIGNlbnRlcnNcbmNsZi5jbHVzdGVyX2NlbnRlcnNfLnNoYXBlIiwic2N0IjoidGVzdF9vYmplY3QoXCJ5X3ByZWRcIilcbnRlc3RfZnVuY3Rpb24oXG4gICAgXCJwcmludFwiLFxuICAgIDEsXG4gICAgbm90X2NhbGxlZF9tc2c9XCJEaWQgeW91IHByaW50IG91dCB0aGUgZmlyc3QgMTAwIGluc3RhbmNlcyBvZiBgeV9wcmVkYD9cIixcbiAgICBpbmNvcnJlY3RfbXNnPVwiRG9uJ3QgZm9yZ2V0IHRvIHByaW50IG91dCB0aGUgZmlyc3QgMTAwIGluc3RhbmNlcyBvZiBgeV9wcmVkYCFcIixcbiAgICBkb19ldmFsPUZhbHNlXG4pXG50ZXN0X2Z1bmN0aW9uKFxuICAgIFwicHJpbnRcIixcbiAgICAyLFxuICAgIG5vdF9jYWxsZWRfbXNnPVwiRGlkIHlvdSBwcmludCBvdXQgdGhlIGZpcnN0IDEwMCBpbnN0YW5jZXMgb2YgYHlfdGVzdGA/XCIsXG4gICAgaW5jb3JyZWN0X21zZz1cIkRvbid0IGZvcmdldCB0byBwcmludCBvdXQgdGhlIGZpcnN0IDEwMCBpbnN0YW5jZXMgb2YgYHlfdGVzdGAhXCIsXG4gICAgZG9fZXZhbD1GYWxzZVxuKVxubXNnX2RhdGE9XCJEaWQgeW91IGZpbGwgaW4gYHNoYXBlYCB0byBwcmludCBvdXQgdGhlIHNoYXBlIG9mIHRoZSBjbHVzdGVyIGNlbnRlcnM/XCJcbnRlc3Rfb2JqZWN0X2FjY2Vzc2VkKFwiY2xmLmNsdXN0ZXJfY2VudGVyc18uc2hhcGVcIiwgbm90X2FjY2Vzc2VkX21zZz1tc2dfZGF0YSlcbnN1Y2Nlc3NfbXNnPVwiQXdlc29tZSFcIiJ9 在上述的程式中,我們預測450筆測試資料的目標值並將結果儲存於y_pred之中,接著將y_pred與y_test的前100個元素印出,很快就可以看到一些結果,知道哪幾個觀測值預測正確,哪幾個觀測值預測錯誤。

接著我們來將預測的目標值視覺化: #建立K-Means模型 fromsklearnimportdatasets fromsklearn.cross_validationimporttrain_test_split fromsklearn.preprocessingimportscale importnumpyasnp fromsklearnimportcluster #載入`Isomap()` fromsklearn.manifoldimportIsomap digits=datasets.load_digits() data=scale(digits.data) X_train,X_test,y_train,y_test,images_train,images_test=train_test_split(data,digits.target,digits.images,test_size=0.25,random_state=42) clf=cluster.KMeans(init='k-means++',n_clusters=10,random_state=42) clf.fit(X_train) #使用Isomap對`X_train`資料降維 X_iso=Isomap(n_neighbors=10).fit_transform(X_train) #使用K-Means演算法 clusters=clf.fit_predict(X_train) #在1x2的網格上繪製子圖形 fig,ax=plt.subplots(1,2,figsize=(8,4)) #調整圖形的外觀 fig.suptitle('PredictedVersusTrainingLabels',fontsize=14,fontweight='bold') fig.subplots_adjust(top=0.85) #加入散佈圖 ax[0].scatter(X_iso[:,0],X_iso[:,1],c=clusters) ax[0].set_title('PredictedTrainingLabels') ax[1].scatter(X_iso[:,0],X_iso[:,1],c=y_train) ax[1].set_title('ActualTrainingLabels') #顯示圖形 plt.show() 這次我們改用Isomap()來對digits資料進行降維,跟主成份分析不同的地方是Isomap屬於非線性的降維方法。

秘訣:改用主成份分析重新執行上面的程式,並觀察跟Isomap有什麼差異: #建立K-Means模型 fromsklearnimportdatasets fromsklearn.cross_validationimporttrain_test_split fromsklearn.preprocessingimportscale importnumpyasnp fromsklearnimportcluster #載入`PCA()` fromsklearn.decompositionimportPCA #使用PCA對`X_train`資料降維 X_pca=PCA(n_components=2).fit_transform(X_train) #使用K-Means演算法 clusters=clf.fit_predict(X_train) #在1x2的網格上繪製子圖形 fig,ax=plt.subplots(1,2,figsize=(8,4)) #調整圖形的外觀 fig.suptitle('PredictedVersusTrainingLabels',fontsize=14,fontweight='bold') fig.subplots_adjust(top=0.85) #加入散佈圖 ax[0].scatter(X_pca[:,0],X_pca[:,1],c=clusters) ax[0].set_title('PredictedTrainingLabels') ax[1].scatter(X_pca[:,0],X_pca[:,1],c=y_train) ax[1].set_title('ActualTrainingLabels') #顯示圖形 plt.show() 我們發現K-Means演算法的效果可能不是很好,但還需要更進一步評估。

評估分群模型的表現 評估模型的表現是機器學習很重要的課題,我們首先將混淆矩陣(Confusionmatrix)印出,試著依照註解的提示完成程式後點選Run觀察結果,如果沒有頭緒,可以點選Solution將程式完成後再點選Run觀察結果: eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiZnJvbSBza2xlYXJuIGltcG9ydCBkYXRhc2V0c1xuZnJvbSBza2xlYXJuLmNyb3NzX3ZhbGlkYXRpb24gaW1wb3J0IHRyYWluX3Rlc3Rfc3BsaXRcbmZyb20gc2tsZWFybi5wcmVwcm9jZXNzaW5nIGltcG9ydCBzY2FsZVxuZnJvbSBza2xlYXJuIGltcG9ydCBjbHVzdGVyXG5kaWdpdHMgPSBkYXRhc2V0cy5sb2FkX2RpZ2l0cygpXG5kYXRhID0gc2NhbGUoZGlnaXRzLmRhdGEpXG5YX3RyYWluLCBYX3Rlc3QsIHlfdHJhaW4sIHlfdGVzdCwgaW1hZ2VzX3RyYWluLCBpbWFnZXNfdGVzdCA9IHRyYWluX3Rlc3Rfc3BsaXQoZGF0YSwgZGlnaXRzLnRhcmdldCwgZGlnaXRzLmltYWdlcywgdGVzdF9zaXplPTAuMjUsIHJhbmRvbV9zdGF0ZT00MilcbmNsZiA9IGNsdXN0ZXIuS01lYW5zKGluaXQ9J2stbWVhbnMrKycsIG5fY2x1c3RlcnM9MTAsIHJhbmRvbV9zdGF0ZT00MilcbmNsZi5maXQoWF90cmFpbilcbnlfcHJlZD1jbGYucHJlZGljdChYX3Rlc3QpIiwic2FtcGxlIjoiIyBJbXBvcnQgYG1ldHJpY3NgIGZyb20gYHNrbGVhcm5gXG5mcm9tIHNrbGVhcm4gaW1wb3J0IF9fX19fX19cblxuIyBQcmludCBvdXQgdGhlIGNvbmZ1c2lvbiBtYXRyaXggd2l0aCBgY29uZnVzaW9uX21hdHJpeCgpYFxucHJpbnQobWV0cmljcy5jb25mdXNpb25fbWF0cml4KHlfdGVzdCwgeV9wcmVkKSkiLCJzb2x1dGlvbiI6IiMgSW1wb3J0IGBtZXRyaWNzYCBmcm9tIGBza2xlYXJuYFxuZnJvbSBza2xlYXJuIGltcG9ydCBtZXRyaWNzXG5cbiMgUHJpbnQgb3V0IHRoZSBjb25mdXNpb24gbWF0cml4IHdpdGggYGNvbmZ1c2lvbl9tYXRyaXgoKWBcbnByaW50KG1ldHJpY3MuY29uZnVzaW9uX21hdHJpeCh5X3Rlc3QsIHlfcHJlZCkpIiwic2N0IjoidGVzdF9pbXBvcnQoXCJza2xlYXJuLm1ldHJpY3NcIiwgc2FtZV9hcyA9IFRydWUsIG5vdF9pbXBvcnRlZF9tc2cgPSBcIkRpZCB5b3UgaW1wb3J0IGBtZXRyaWNzYCBmcm9tIGBza2xlYXJuYD9cIiwgaW5jb3JyZWN0X2FzX21zZyA9IFwiRG9uJ3QgZm9yZ2V0IHRvIGltcG9ydCBgbWV0cmljc2AgZnJvbSBgc2tsZWFybmAhXCIpXG50ZXN0X2Z1bmN0aW9uKFxuICAgIFwicHJpbnRcIixcbiAgICBub3RfY2FsbGVkX21zZz1cIkRpZCB5b3UgcHJpbnQgb3V0IHRoZSBjb25mdXNpb24gbWF0cml4P1wiLFxuICAgIGluY29ycmVjdF9tc2c9XCJEb24ndCBmb3JnZXQgdG8gcHJpbnQgb3V0IHRoZSBjb25mdXNpb24gbWF0cml4IVwiLFxuICAgIGRvX2V2YWw9RmFsc2VcbilcbnN1Y2Nlc3NfbXNnPVwiV2VsbCBkb25lISBOb3csIHdoYXQgZG8gdGhlIHJlc3VsdHMgdGVsbCB5b3U/XCIifQ== 我們可以看到41個5被正確預測,11個8被正確預測,但混淆矩陣並不是唯一的評估方式,還有非常多的評估指標可以參考,直接點選Run觀察結果: eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiZnJvbSBza2xlYXJuIGltcG9ydCBkYXRhc2V0c1xuZnJvbSBza2xlYXJuLmNyb3NzX3ZhbGlkYXRpb24gaW1wb3J0IHRyYWluX3Rlc3Rfc3BsaXRcbmZyb20gc2tsZWFybi5wcmVwcm9jZXNzaW5nIGltcG9ydCBzY2FsZVxuZnJvbSBza2xlYXJuIGltcG9ydCBjbHVzdGVyXG5mcm9tIHNrbGVhcm4ubWV0cmljcyBpbXBvcnQgaG9tb2dlbmVpdHlfc2NvcmUsIGNvbXBsZXRlbmVzc19zY29yZSwgdl9tZWFzdXJlX3Njb3JlLCBhZGp1c3RlZF9yYW5kX3Njb3JlLCBhZGp1c3RlZF9tdXR1YWxfaW5mb19zY29yZSwgc2lsaG91ZXR0ZV9zY29yZVxuZGlnaXRzID0gZGF0YXNldHMubG9hZF9kaWdpdHMoKVxuZGF0YSA9IHNjYWxlKGRpZ2l0cy5kYXRhKVxuWF90cmFpbiwgWF90ZXN0LCB5X3RyYWluLCB5X3Rlc3QsIGltYWdlc190cmFpbiwgaW1hZ2VzX3Rlc3QgPSB0cmFpbl90ZXN0X3NwbGl0KGRhdGEsIGRpZ2l0cy50YXJnZXQsIGRpZ2l0cy5pbWFnZXMsIHRlc3Rfc2l6ZT0wLjI1LCByYW5kb21fc3RhdGU9NDIpXG5jbGYgPSBjbHVzdGVyLktNZWFucyhpbml0PSdrLW1lYW5zKysnLCBuX2NsdXN0ZXJzPTEwLCByYW5kb21fc3RhdGU9NDIpXG5jbGYuZml0KFhfdHJhaW4pXG55X3ByZWQ9Y2xmLnByZWRpY3QoWF90ZXN0KSIsInNhbXBsZSI6ImZyb20gc2tsZWFybi5tZXRyaWNzIGltcG9ydCBob21vZ2VuZWl0eV9zY29yZSwgY29tcGxldGVuZXNzX3Njb3JlLCB2X21lYXN1cmVfc2NvcmUsIGFkanVzdGVkX3JhbmRfc2NvcmUsIGFkanVzdGVkX211dHVhbF9pbmZvX3Njb3JlLCBzaWxob3VldHRlX3Njb3JlXG5wcmludCgnJSA5cycgJSAnaW5lcnRpYSAgICBob21vICAgY29tcGwgIHYtbWVhcyAgICAgQVJJIEFNSSAgc2lsaG91ZXR0ZScpXG5wcmludCgnJWkgICAlLjNmICAgJS4zZiAgICUuM2YgICAlLjNmICAgJS4zZiAgICAlLjNmJ1xuICAgICAgICAgICUoY2xmLmluZXJ0aWFfLFxuICAgICAgaG9tb2dlbmVpdHlfc2NvcmUoeV90ZXN0LCB5X3ByZWQpLFxuICAgICAgY29tcGxldGVuZXNzX3Njb3JlKHlfdGVzdCwgeV9wcmVkKSxcbiAgICAgIHZfbWVhc3VyZV9zY29yZSh5X3Rlc3QsIHlfcHJlZCksXG4gICAgICBhZGp1c3RlZF9yYW5kX3Njb3JlKHlfdGVzdCwgeV9wcmVkKSxcbiAgICAgIGFkanVzdGVkX211dHVhbF9pbmZvX3Njb3JlKHlfdGVzdCwgeV9wcmVkKSxcbiAgICAgIHNpbGhvdWV0dGVfc2NvcmUoWF90ZXN0LCB5X3ByZWQsIG1ldHJpYz0nZXVjbGlkZWFuJykpKSJ9 這些指標包含了: Homogeneityscore Completenessscore V-measurescore Adjustedrandscore AdjustedMutualInfoscore,AMIscore Silhouettescore 但這些評估指標都不是太好,像是silhouettescore接近0,代表很多的觀測值都接近分群邊界而可能被分到錯誤的群集中;而ARI則告訴我們同一群集中的觀測值沒有完全相同;Completenessscore則告訴我們一定有觀測值被分在錯誤的群集。

這提示我們針對digits資料也許應該嘗試另外一種演算法。

嘗試另外一種演算法:支持向量機(SupportVectorMachines,SVM) 當訓練資料沒有目標值的時候適用前述的分群演算法,當訓練資料具有目標值的時候就能夠適用分類演算法。

我們再回顧一下scikit-learn機器學習地圖,在分類演算的區域第一個看到的是線性SVC,讓我們對digits資料使用這個演算法試試看,直接點選Run觀察結果: eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiZnJvbSBza2xlYXJuIGltcG9ydCBkYXRhc2V0c1xuZnJvbSBza2xlYXJuLnByZXByb2Nlc3NpbmcgaW1wb3J0IHNjYWxlXG5mcm9tIHNrbGVhcm4gaW1wb3J0IGNsdXN0ZXJcbmRpZ2l0cyA9IGRhdGFzZXRzLmxvYWRfZGlnaXRzKClcbmRhdGEgPSBzY2FsZShkaWdpdHMuZGF0YSkiLCJzYW1wbGUiOiIjIEltcG9ydCBgdHJhaW5fdGVzdF9zcGxpdGBcbmZyb20gc2tsZWFybi5jcm9zc192YWxpZGF0aW9uIGltcG9ydCB0cmFpbl90ZXN0X3NwbGl0XG5cbiMgU3BsaXQgdGhlIGRhdGEgaW50byB0cmFpbmluZyBhbmQgdGVzdCBzZXRzIFxuWF90cmFpbiwgWF90ZXN0LCB5X3RyYWluLCB5X3Rlc3QsIGltYWdlc190cmFpbiwgaW1hZ2VzX3Rlc3QgPSB0cmFpbl90ZXN0X3NwbGl0KGRpZ2l0cy5kYXRhLCBkaWdpdHMudGFyZ2V0LCBkaWdpdHMuaW1hZ2VzLCB0ZXN0X3NpemU9MC4yNSwgcmFuZG9tX3N0YXRlPTQyKVxuXG4jIEltcG9ydCB0aGUgYHN2bWAgbW9kZWxcbmZyb20gc2tsZWFybiBpbXBvcnQgc3ZtXG5cbiMgQ3JlYXRlIHRoZSBTVkMgbW9kZWwgXG5zdmNfbW9kZWwgPSBzdm0uU1ZDKGdhbW1hPTAuMDAxLCBDPTEwMC4sIGtlcm5lbD0nbGluZWFyJylcblxuIyBGaXQgdGhlIGRhdGEgdG8gdGhlIFNWQyBtb2RlbFxuc3ZjX21vZGVsLmZpdChYX3RyYWluLCB5X3RyYWluKSIsInNvbHV0aW9uIjoiIyBJbXBvcnQgYHRyYWluX3Rlc3Rfc3BsaXRgXG5mcm9tIHNrbGVhcm4uY3Jvc3NfdmFsaWRhdGlvbiBpbXBvcnQgdHJhaW5fdGVzdF9zcGxpdFxuXG4jIFNwbGl0IHRoZSBkYXRhIGludG8gdHJhaW5pbmcgYW5kIHRlc3Qgc2V0cyBcblhfdHJhaW4sIFhfdGVzdCwgeV90cmFpbiwgeV90ZXN0LCBpbWFnZXNfdHJhaW4sIGltYWdlc190ZXN0ID0gdHJhaW5fdGVzdF9zcGxpdChkaWdpdHMuZGF0YSwgZGlnaXRzLnRhcmdldCwgZGlnaXRzLmltYWdlcywgdGVzdF9zaXplPTAuMjUsIHJhbmRvbV9zdGF0ZT00MilcblxuIyBJbXBvcnQgdGhlIGBzdm1gIG1vZGVsXG5mcm9tIHNrbGVhcm4gaW1wb3J0IHN2bVxuXG4jIENyZWF0ZSB0aGUgU1ZDIG1vZGVsIFxuc3ZjX21vZGVsID0gc3ZtLlNWQyhnYW1tYT0wLjAwMSwgQz0xMDAuLCBrZXJuZWw9J2xpbmVhcicpXG5cbiMgRml0IHRoZSBkYXRhIHRvIHRoZSBTVkMgbW9kZWxcbnN2Y19tb2RlbC5maXQoWF90cmFpbiwgeV90cmFpbikiLCJzY3QiOiJ0ZXN0X2ltcG9ydChcInNrbGVhcm4uY3Jvc3NfdmFsaWRhdGlvbi50cmFpbl90ZXN0X3NwbGl0XCIsIHNhbWVfYXMgPSBUcnVlLCBub3RfaW1wb3J0ZWRfbXNnID0gXCJEaWQgeW91IGltcG9ydCBgdHJhaW5fdGVzdF9zcGxpdGAgZnJvbSBgc2tsZWFybi5jcm9zc192YWxpZGF0aW9uYD9cIiwgaW5jb3JyZWN0X2FzX21zZyA9IFwiRG9uJ3QgZm9yZ2V0IHRvIGltcG9ydCBgdHJhaW5fdGVzdF9zcGxpdGAgZnJvbSBgc2tsZWFybi5jcm9zc192YWxpZGF0aW9uYCFcIilcbnRlc3Rfb2JqZWN0KFwiWF90cmFpblwiLCBkb19ldmFsPUZhbHNlLCB1bmRlZmluZWRfbXNnPVwiZGlkIHlvdSBkZWZpbmUgYFhfdHJhaW5gP1wiKVxudGVzdF9vYmplY3QoXCJYX3Rlc3RcIiwgZG9fZXZhbD1GYWxzZSwgdW5kZWZpbmVkX21zZz1cImRpZCB5b3UgZGVmaW5lIGBYX3Rlc3RgP1wiKVxudGVzdF9vYmplY3QoXCJ5X3RyYWluXCIsIGRvX2V2YWw9RmFsc2UsIHVuZGVmaW5lZF9tc2c9XCJkaWQgeW91IGRlZmluZSBgeV90cmFpbmA/XCIpXG50ZXN0X29iamVjdChcInlfdGVzdFwiLCBkb19ldmFsPUZhbHNlLCB1bmRlZmluZWRfbXNnPVwiZGlkIHlvdSBkZWZpbmUgYHlfdGVzdGA/XCIpXG50ZXN0X29iamVjdChcImltYWdlc190cmFpblwiLCBkb19ldmFsPUZhbHNlLCB1bmRlZmluZWRfbXNnPVwiZGlkIHlvdSBkZWZpbmUgYGltYWdlc190cmFpbmA/XCIpXG50ZXN0X29iamVjdChcImltYWdlc190ZXN0XCIsIGRvX2V2YWw9RmFsc2UsIHVuZGVmaW5lZF9tc2c9XCJkaWQgeW91IGRlZmluZSBgaW1hZ2VzX3Rlc3RgP1wiKVxudGVzdF9pbXBvcnQoXCJza2xlYXJuLnN2bVwiLCBzYW1lX2FzID0gVHJ1ZSwgbm90X2ltcG9ydGVkX21zZyA9IFwiRGlkIHlvdSBpbXBvcnQgYHN2bWAgZnJvbSBgc2tsZWFybmA/XCIsIGluY29ycmVjdF9hc19tc2cgPSBcIkRvbid0IGZvcmdldCB0byBpbXBvcnQgYHN2bWAgZnJvbSBgc2tsZWFybmAhXCIpXG50ZXN0X29iamVjdChcInN2Y19tb2RlbFwiLCBkb19ldmFsPUZhbHNlKVxudGVzdF9mdW5jdGlvbihcInN2Y19tb2RlbC5maXRcIiwgZG9fZXZhbD1GYWxzZSlcbnN1Y2Nlc3NfbXNnPVwiR3JlYXQgam9iIVwiIn0= 在這段程式中我們手動設定了gamma參數,但其實透過網格搜索(Gridsearch)或交叉驗證(Crossvalidation)都可以自動找出合適的參數設定,但這些方法並不是這份教學的重點,所以我們只是很快地展示一下如何使用網格搜索來調整參數而不去深究,直接點選Run觀察結果: eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiZnJvbSBza2xlYXJuIGltcG9ydCBzdm1cbmZyb20gc2tsZWFybiBpbXBvcnQgZGF0YXNldHNcbmZyb20gc2tsZWFybi5jcm9zc192YWxpZGF0aW9uIGltcG9ydCB0cmFpbl90ZXN0X3NwbGl0XG5kaWdpdHMgPSBkYXRhc2V0cy5sb2FkX2RpZ2l0cygpIiwic2FtcGxlIjoiIyBTcGxpdCB0aGUgYGRpZ2l0c2AgZGF0YSBpbnRvIHR3byBlcXVhbCBzZXRzXG5YX3RyYWluLCBYX3Rlc3QsIHlfdHJhaW4sIHlfdGVzdCA9IHRyYWluX3Rlc3Rfc3BsaXQoZGlnaXRzLmRhdGEsIGRpZ2l0cy50YXJnZXQsIHRlc3Rfc2l6ZT0wLjUsIHJhbmRvbV9zdGF0ZT0wKVxuXG4jIEltcG9ydCBHcmlkU2VhcmNoQ1ZcbmZyb20gc2tsZWFybi5ncmlkX3NlYXJjaCBpbXBvcnQgR3JpZFNlYXJjaENWXG5cbiMgU2V0IHRoZSBwYXJhbWV0ZXIgY2FuZGlkYXRlc1xucGFyYW1ldGVyX2NhbmRpZGF0ZXMgPSBbXG4gIHsnQyc6IFsxLCAxMCwgMTAwLCAxMDAwXSwgJ2tlcm5lbCc6IFsnbGluZWFyJ119LFxuICB7J0MnOiBbMSwgMTAsIDEwMCwgMTAwMF0sICdnYW1tYSc6IFswLjAwMSwgMC4wMDAxXSwgJ2tlcm5lbCc6IFsncmJmJ119LFxuXVxuXG4jIENyZWF0ZSBhIGNsYXNzaWZpZXIgd2l0aCB0aGUgcGFyYW1ldGVyIGNhbmRpZGF0ZXNcbmNsZiA9IEdyaWRTZWFyY2hDVihlc3RpbWF0b3I9c3ZtLlNWQygpLCBwYXJhbV9ncmlkPXBhcmFtZXRlcl9jYW5kaWRhdGVzLCBuX2pvYnM9LTEpXG5cbiMgVHJhaW4gdGhlIGNsYXNzaWZpZXIgb24gdHJhaW5pbmcgZGF0YVxuY2xmLmZpdChYX3RyYWluLCB5X3RyYWluKVxuXG4jIFByaW50IG91dCB0aGUgcmVzdWx0cyBcbnByaW50KCdCZXN0IHNjb3JlIGZvciB0cmFpbmluZyBkYXRhOicsIGNsZi5iZXN0X3Njb3JlXylcbnByaW50KCdCZXN0IGBDYDonLGNsZi5iZXN0X2VzdGltYXRvcl8uQylcbnByaW50KCdCZXN0IGtlcm5lbDonLGNsZi5iZXN0X2VzdGltYXRvcl8ua2VybmVsKVxucHJpbnQoJ0Jlc3QgYGdhbW1hYDonLGNsZi5iZXN0X2VzdGltYXRvcl8uZ2FtbWEpIn0= 接下來,我們會比較手動設定參數與使用網格搜索調整參數的兩個分類器,看看網格搜索找出來的參數是不是真的比較好,直接點選Run觀察結果: eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiZnJvbSBza2xlYXJuIGltcG9ydCBzdm1cbmZyb20gc2tsZWFybiBpbXBvcnQgZGF0YXNldHNcbmZyb20gc2tsZWFybi5jcm9zc192YWxpZGF0aW9uIGltcG9ydCB0cmFpbl90ZXN0X3NwbGl0XG5kaWdpdHMgPSBkYXRhc2V0cy5sb2FkX2RpZ2l0cygpXG5YX3RyYWluLCBYX3Rlc3QsIHlfdHJhaW4sIHlfdGVzdCA9IHRyYWluX3Rlc3Rfc3BsaXQoZGlnaXRzLmRhdGEsIGRpZ2l0cy50YXJnZXQsIHRlc3Rfc2l6ZT0wLjUsIHJhbmRvbV9zdGF0ZT0wKVxuZnJvbSBza2xlYXJuLmdyaWRfc2VhcmNoIGltcG9ydCBHcmlkU2VhcmNoQ1ZcbnBhcmFtZXRlcl9jYW5kaWRhdGVzID0gW1xuICB7J0MnOiBbMSwgMTAsIDEwMCwgMTAwMF0sICdrZXJuZWwnOiBbJ2xpbmVhciddfSxcbiAgeydDJzogWzEsIDEwLCAxMDAsIDEwMDBdLCAnZ2FtbWEnOiBbMC4wMDEsIDAuMDAwMV0sICdrZXJuZWwnOiBbJ3JiZiddfSxcbl1cbmNsZiA9IEdyaWRTZWFyY2hDVihlc3RpbWF0b3I9c3ZtLlNWQygpLCBwYXJhbV9ncmlkPXBhcmFtZXRlcl9jYW5kaWRhdGVzLCBuX2pvYnM9LTEpXG5jbGYuZml0KFhfdHJhaW4sIHlfdHJhaW4pIiwic2FtcGxlIjoiIyBBcHBseSB0aGUgY2xhc3NpZmllciB0byB0aGUgdGVzdCBkYXRhLCBhbmQgdmlldyB0aGUgYWNjdXJhY3kgc2NvcmVcbmNsZi5zY29yZShYX3Rlc3QsIHlfdGVzdCkgIFxuXG4jIFRyYWluIGFuZCBzY29yZSBhIG5ldyBjbGFzc2lmaWVyIHdpdGggdGhlIGdyaWQgc2VhcmNoIHBhcmFtZXRlcnNcbnN2bS5TVkMoQz0xMCwga2VybmVsPSdyYmYnLCBnYW1tYT0wLjAwMSkuZml0KFhfdHJhaW4sIHlfdHJhaW4pLnNjb3JlKFhfdGVzdCwgeV90ZXN0KSJ9 在這樣的參數設定下準確率高達99%! 在使用網格搜索以前,我們將kernel參數指定為linear,在SVM演算法kernel參數預設為rbf,而除了前面指定的linear,尚可以設定為poly。

但是究竟kernel是什麼?kernel是一種計算訓練資料觀測值相似度的函數,SVM演算法利用這個函數來進行分類,我們先假設這些觀測值線性可分,所以設定kernel=linear,但是網格搜索的結果則建議我們使用kernel=rbf的參數設定。

我們接著使用kernel=linear的分類器來預測測試資料,試著依照註解的提示完成程式後點選Run觀察結果,如果沒有頭緒,可以點選Solution將程式完成後再點選Run觀察結果: eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiZnJvbSBza2xlYXJuIGltcG9ydCBkYXRhc2V0c1xuZnJvbSBza2xlYXJuLnByZXByb2Nlc3NpbmcgaW1wb3J0IHNjYWxlXG5mcm9tIHNrbGVhcm4gaW1wb3J0IGNsdXN0ZXJcbmRpZ2l0cyA9IGRhdGFzZXRzLmxvYWRfZGlnaXRzKClcbmRhdGEgPSBzY2FsZShkaWdpdHMuZGF0YSlcbmZyb20gc2tsZWFybi5jcm9zc192YWxpZGF0aW9uIGltcG9ydCB0cmFpbl90ZXN0X3NwbGl0XG5YX3RyYWluLCBYX3Rlc3QsIHlfdHJhaW4sIHlfdGVzdCwgaW1hZ2VzX3RyYWluLCBpbWFnZXNfdGVzdCA9IHRyYWluX3Rlc3Rfc3BsaXQoZGlnaXRzLmRhdGEsIGRpZ2l0cy50YXJnZXQsIGRpZ2l0cy5pbWFnZXMsIHRlc3Rfc2l6ZT0wLjI1LCByYW5kb21fc3RhdGU9NDIpXG5mcm9tIHNrbGVhcm4gaW1wb3J0IHN2bVxuc3ZjX21vZGVsID0gc3ZtLlNWQyhnYW1tYT0wLjAwMSwgQz0xMDAuLCBrZXJuZWw9J2xpbmVhcicpXG5zdmNfbW9kZWwuZml0KFhfdHJhaW4sIHlfdHJhaW4pIiwic2FtcGxlIjoiIyBQcmVkaWN0IHRoZSBsYWJlbCBvZiBgWF90ZXN0YFxucHJpbnQoc3ZjX21vZGVsLnByZWRpY3QoX19fX19fKSlcblxuIyBQcmludCBgeV90ZXN0YCB0byBjaGVjayB0aGUgcmVzdWx0c1xucHJpbnQoX19fX19fKSIsInNvbHV0aW9uIjoiIyBQcmVkaWN0IHRoZSBsYWJlbCBvZiBgWF90ZXN0YFxucHJpbnQoc3ZjX21vZGVsLnByZWRpY3QoWF90ZXN0KSlcblxuIyBQcmludCBgeV90ZXN0YCB0byBjaGVjayB0aGUgcmVzdWx0c1xucHJpbnQoeV90ZXN0KSIsInNjdCI6InRlc3RfZnVuY3Rpb24oXG4gICAgXCJwcmludFwiLFxuICAgIDEsXG4gICAgbm90X2NhbGxlZF9tc2c9XCJEaWQgeW91IHByaW50IG91dCB0aGUgcHJlZGljdGVkIGxhYmVscyBvZiBgWF90ZXN0YD9cIixcbiAgICBpbmNvcnJlY3RfbXNnPVwiRG9uJ3QgZm9yZ2V0IHRvIHByaW50IG91dCB0aGUgcHJlZGljdGVkIGxhYmVscyBvZiBgWF90ZXN0YCFcIixcbiAgICBkb19ldmFsPUZhbHNlXG4pXG50ZXN0X2Z1bmN0aW9uKFxuICAgIFwicHJpbnRcIixcbiAgICAyLFxuICAgIG5vdF9jYWxsZWRfbXNnPVwiRGlkIHlvdSBwcmludCBvdXQgdGhlIHRydWUgbGFiZWxzIG9mIGB5X3Rlc3RgP1wiLFxuICAgIGluY29ycmVjdF9tc2c9XCJEb24ndCBmb3JnZXQgdG8gcmV2ZWFsaW5nIHRoZSB0cnVlIGxhYmVscyBieSBwcmludGluZyBvdXQgYHlfdGVzdGAhXCIsXG4gICAgZG9fZXZhbD1GYWxzZVxuKVxuc3VjY2Vzc19tc2coXCJXZWxsIGRvbmUhXCIpIn0= 視覺化手寫數字圖片與預測的結果: #使用SVC演算法 fromsklearnimportdatasets fromsklearn.preprocessingimportscale fromsklearnimportcluster digits=datasets.load_digits() data=scale(digits.data) fromsklearn.cross_validationimporttrain_test_split X_train,X_test,y_train,y_test,images_train,images_test=train_test_split(digits.data,digits.target,digits.images,test_size=0.25,random_state=42) fromsklearnimportsvm svc_model=svm.SVC(gamma=0.001,C=100.,kernel='linear') svc_model.fit(X_train,y_train) #載入matplotlib importmatplotlib.pyplotasplt #將預測結果指派給`predicted` predicted=svc_model.predict(X_test) #將`images_test`與`predicted`存入`images_and_predictions` images_and_predictions=list(zip(images_test,predicted)) #繪製前四個元素 forindex,(image,prediction)inenumerate(images_and_predictions[:4]): #在1x4的網格上繪製子圖形 plt.subplot(1,4,index+1) #關掉座標軸的刻度 plt.axis('off') #色彩用灰階 plt.imshow(image,cmap=plt.cm.binary) #加入標題 plt.title('Predicted:'+str(prediction)) #顯示圖形 plt.show() 這跟我們在探索性分析時作的視覺化非常相似,只是這次我們只顯示了前面四個測試資料與預測結果。

那麼這個模型的表現如何呢?試著依照註解的提示完成程式後點選Run觀察結果,如果沒有頭緒,可以點選Solution將程式完成後再點選Run觀察結果: eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiZnJvbSBza2xlYXJuIGltcG9ydCBkYXRhc2V0c1xuZnJvbSBza2xlYXJuLnByZXByb2Nlc3NpbmcgaW1wb3J0IHNjYWxlXG5mcm9tIHNrbGVhcm4gaW1wb3J0IGNsdXN0ZXJcbmRpZ2l0cyA9IGRhdGFzZXRzLmxvYWRfZGlnaXRzKClcbmRhdGEgPSBzY2FsZShkaWdpdHMuZGF0YSlcbmZyb20gc2tsZWFybi5jcm9zc192YWxpZGF0aW9uIGltcG9ydCB0cmFpbl90ZXN0X3NwbGl0XG5YX3RyYWluLCBYX3Rlc3QsIHlfdHJhaW4sIHlfdGVzdCwgaW1hZ2VzX3RyYWluLCBpbWFnZXNfdGVzdCA9IHRyYWluX3Rlc3Rfc3BsaXQoZGlnaXRzLmRhdGEsIGRpZ2l0cy50YXJnZXQsIGRpZ2l0cy5pbWFnZXMsIHRlc3Rfc2l6ZT0wLjI1LCByYW5kb21fc3RhdGU9NDIpXG5mcm9tIHNrbGVhcm4gaW1wb3J0IHN2bVxuc3ZjX21vZGVsID0gc3ZtLlNWQyhnYW1tYT0wLjAwMSwgQz0xMDAuLCBrZXJuZWw9J2xpbmVhcicpXG5zdmNfbW9kZWwuZml0KFhfdHJhaW4sIHlfdHJhaW4pXG5wcmVkaWN0ZWQgPSBzdmNfbW9kZWwucHJlZGljdChYX3Rlc3QpIiwic2FtcGxlIjoiIyBJbXBvcnQgYG1ldHJpY3NgXG5mcm9tIHNrbGVhcm4gaW1wb3J0IG1ldHJpY3NcblxuIyBQcmludCB0aGUgY2xhc3NpZmljYXRpb24gcmVwb3J0IG9mIGB5X3Rlc3RgIGFuZCBgcHJlZGljdGVkYFxucHJpbnQobWV0cmljcy5jbGFzc2lmaWNhdGlvbl9yZXBvcnQoX19fX19fLCBfX19fX19fX18pKVxuXG4jIFByaW50IHRoZSBjb25mdXNpb24gbWF0cml4IG9mIGB5X3Rlc3RgIGFuZCBgcHJlZGljdGVkYFxucHJpbnQobWV0cmljcy5jb25mdXNpb25fbWF0cml4KF9fX19fXywgX19fX19fX19fKSkiLCJzb2x1dGlvbiI6IiMgSW1wb3J0IGBtZXRyaWNzYFxuZnJvbSBza2xlYXJuIGltcG9ydCBtZXRyaWNzXG5cbiMgUHJpbnQgdGhlIGNsYXNzaWZpY2F0aW9uIHJlcG9ydCBvZiBgeV90ZXN0YCBhbmQgYHByZWRpY3RlZGBcbnByaW50KG1ldHJpY3MuY2xhc3NpZmljYXRpb25fcmVwb3J0KHlfdGVzdCwgcHJlZGljdGVkKSlcblxuIyBQcmludCB0aGUgY29uZnVzaW9uIG1hdHJpeFxucHJpbnQobWV0cmljcy5jb25mdXNpb25fbWF0cml4KHlfdGVzdCwgcHJlZGljdGVkKSkiLCJzY3QiOiJ0ZXN0X2ltcG9ydChcInNrbGVhcm4ubWV0cmljc1wiLCBzYW1lX2FzID0gVHJ1ZSwgbm90X2ltcG9ydGVkX21zZyA9IFwiRGlkIHlvdSBpbXBvcnQgYG1ldHJpY3NgIGZyb20gYHNrbGVhcm5gP1wiLCBpbmNvcnJlY3RfYXNfbXNnID0gXCJEb24ndCBmb3JnZXQgdG8gaW1wb3J0IGBtZXRyaWNzYCBmcm9tIGBza2xlYXJuYCFcIilcbm5vdF9jYWxsZWRfbXNnPVwiRGlkIHlvdSBmaWxsIGluIGB5X3Rlc3RgIGFuZCBgcHJlZGljdGVkYD9cIlxuaW5jb3JyZWN0X21zZz1cIkRvbid0IGZvcmdldCB0byBmaWxsIGluIGB5X3Rlc3RgIGFzIHRoZSBmaXJzdCBhcmd1bWVudCwgYHByZWRpY3RlZGAgYXMgdGhlIHNlY29uZCBhcmd1bWVudCFcIlxudGVzdF9mdW5jdGlvbihcInByaW50XCIsIDEsIGRvX2V2YWw9RmFsc2UsIG5vdF9jYWxsZWRfbXNnID0gbm90X2NhbGxlZF9tc2csIGluY29ycmVjdF9tc2cgPSBpbmNvcnJlY3RfbXNnKVxudGVzdF9mdW5jdGlvbihcInByaW50XCIsIDIsIGRvX2V2YWw9RmFsc2UsIG5vdF9jYWxsZWRfbXNnID0gbm90X2NhbGxlZF9tc2csIGluY29ycmVjdF9tc2cgPSBpbmNvcnJlY3RfbXNnKVxuc3VjY2Vzc19tc2c9XCJXZWxsIGRvbmUhIE5vdywgY2hlY2sgdGhlIHJlc3VsdHMgb2YgdGhlIGNvbmZ1c2lvbiBtYXRyaXguIERvZXMgdGhpcyBtb2RlbCBwZXJmb3JtIGJldHRlcj9cIiJ9 我們很明顯地看出SVC表現得比先前的K-Means分群好得太多,接著我們利用Isomap()視覺化預測結果與目標值: #使用SVC演算法 fromsklearnimportdatasets fromsklearn.preprocessingimportscale fromsklearnimportcluster importmatplotlib.pyplotasplt fromsklearn.manifoldimportIsomap digits=datasets.load_digits() data=scale(digits.data) fromsklearn.cross_validationimporttrain_test_split X_train,X_test,y_train,y_test,images_train,images_test=train_test_split(digits.data,digits.target,digits.images,test_size=0.25,random_state=42) fromsklearnimportsvm svc_model=svm.SVC(gamma=0.001,C=100.,kernel='linear') svc_model.fit(X_train,y_train) #對`digits`資料降維 X_iso=Isomap(n_neighbors=10).fit_transform(X_train) #使用SVC演算法 predicted=svc_model.predict(X_train) #在1x2的網格上繪製子圖形 fig,ax=plt.subplots(1,2,figsize=(8,4)) #調整外觀 fig.subplots_adjust(top=0.85) #繪製散佈圖 ax[0].scatter(X_iso[:,0],X_iso[:,1],c=predicted) ax[0].set_title('Predictedlabels') ax[1].scatter(X_iso[:,0],X_iso[:,1],c=y_train) ax[1].set_title('ActualLabels') #加入標題 fig.suptitle('Predictedversusactuallabels',fontsize=14,fontweight='bold') #顯示圖形 plt.show() 完成後我們可以看到這張散佈圖: 從上圖我們可以看到非常好的分類結果,這真是天大的好消息:) 下一步呢? 圖片辨識 恭喜您完成了這份教學,我們示範了如何對digits資料使用監督式與非監督式的機器學習演算法!假如您想要練習更多關於數字圖片辨識的機器學習演算法,那絕對不能錯過MNIST資料,您可以在這裡下載。

處理MNIST資料的手法跟這份教學的內容非常相似,除此之外您還可以參閱這個網頁如何對MNIST資料使用K-Means演算法。

如果您已經練習過使用scikit-learn辨識手寫數字圖片,也許可以考慮挑戰更高難度的字母與數字圖片辨識,這個著名的資料是Chars74K,包含了74,000張除了手寫數字0到9以外,還有大寫A到Z以及小寫的a到z的圖片,您可以在這裡下載。

資料視覺化與pandas套件 這份教學主要的內容簡介是Python與機器學習,但這只是我們跟Python與資料科學旅程的開端而已,如果對於資料視覺化有興趣接下來可以參閱InteractiveDataVisualizationwithBokehcourse或者對於在Python中使用資料框(dataframe)有興趣可以參閱pandasFoundationcourse。

LearnmoreaboutPythonandMachineLearningPostedin:PythonMachineLearningShareon:LinkedInLinkedInFacebookFacebookTwitterTwitterCopyCopylink←Backtotutorial



請為這篇文章評分?