PhD筆記05

W01 課程介紹

2023-09-12-09:00-12:00 賴錦慧教授 chlai@cycu.edu.tw

同學：上午的Data Science有 4個泰國、2個印尼、2個德國學生加上我共9個選修，不過一聽說要學過統計比較好，可能會有一半的人退出。regression and probebility

Data Science Definition:
Data science is an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, sither structured or unstructured, which is a continuation of some of the data analysis fields such a s statistics, machine learning, data mining, and predictive analytics, similar to knoledge Discovery in Databases (KDD).
Cloud computingL key enabler of data science
- Allows data science on a massive scale

Major Steps
-Identify problems
-Collect data
-Prepare dataL clean and filter data
--Build model
-Evaluate model
-Communicate the results

Explaning the past <-> Predicting the future
-Classification: Decision tree - Algorithm
-clustering method: to tag different data
Clustering is used to identify groups of similar objects in datasets with two or more variable quantities. In practice, this data may be collected from marketing, biomedical, or geospatial databases, among many other places.

-R is statistic languange. (Azure by Microsoft needs to pay)
-define your problem first, than set up the object, and then find the solution.

Data science
Data science is an interdisciplinary field the power processes in the systems to extract knowledge all insights from data in various forms, either structure rate or unstructured, which is a continuation of some of the data analysis fields such as statistics, machine learning, data mining, and a predictive analytics, similar to knowledge Discovery in database (KDD).
Main Objectives
- Explanation
- Prediction

Data science
Data science aims to deliver Knowledge from big data, efficiently and intelligently. Data science encampasses the set of activities, tools, and methods that enable data-driven activities in science, business, medicine, and government.
A common epistemic requirement in assessing whether new knowledge is actionable for decision making is its predictive power, not just it's ability to explain the past

Data science Skillset

=Hacking Skills + Substantive Expertise + Math and Statistics Knowledge
=Machine Learning - Danger zone! - Traditional Research

- Data science, due to its interdisciplinary nature, requires an intersection of abilities: hacking skills, math and statistics knowledge, and the substantive expertise in a field of science.
- hacking skills are necessary for working with massive amounts of electronic data there must be acquired, cleaned, and manipulated
- math and statistics knowledge allows a data scientist to choose appropriate methods and a tools in order to extract insight from data.
- substantive expertise in a scientific field is crucial for generating motivating questions and hypotheses and interpreting results
- Traditional Research lies at the intersection of knowledge of math and the statistics with substantive expertise in a scientific field.
- Machine learning stems from combining hacking skills with math and the statistics knowledge, but does not require scientific motivation
- Danger zone! hacking skills combined with substantive scientific expertise without the rigorous methods can beget in correct analysis.

-2023/09/26 online course- 12月-有安排其他教授來演講 AI&Machine learning

book: Data science for business already download. 已下載

Data Science Applications:
AI and Data Science in Aviation Industry: 5 Real-life Use Cases
AI vs. Data Science: Differences in Technology and Use Cases
Learn the machine learning software week3:
Weka 3: Machine Learning Software in Java
資料探勘常用軟體Weka由紐西蘭懷卡托大學用Java開發。的

Refrence Books 參考書目:
Data Mining: Practical Machine Learning Tools and Techniques
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
Data Mining: The Textbook, Charu C. Aggarwal
Recommender Systems: An Introduction
Mining of Massive Datasets, Cambridge University

Data Science due to its interdisciplinary nature, requires that an intersection of abilities. hacking skills, math and the statistics knowledge, and substantive expertise in the field of science.

Assignments:

Class Presentation
Choose ome chapter from the book - "Data Science for Business" for presentation
1.Preparing a PowerPoing file to present the details in the chapter in 30~40 minutes.
2.You can provide additional materials in the presentation
3.Share what you learn from the chapter
Discussion
- 5 minutes - Prepare 1 or 2 questions for the discussion

Oct. will start presentation.
DBA: Data Base Admistrator
人員職務編組: DBA-Data Analyst-Business Analyst-End User(Who make decision)

賴老師曾拿Yelp Open Dataset來做分析研究用。Data Pre-Processing - Data Clean - Data Analysis。這個dataset用jason格式，因為太大，太複雜。蒐集評論的text 做分析。

yelp.com是一個網站，它提供了用戶對各種商家和服務的評論和推薦，例如餐廳、牙醫、酒吧、美容店、醫生等https://www.yelp.com/。用戶可以在Yelp上搜索他們感興趣的商家或服務，並查看其他用戶的評分、評論、照片和相關資訊。用戶也可以在Yelp上註冊賬號，並發表自己的評論和照片，以及與其他用戶分享心得和建議。Yelp不僅適用於美國，還適用於其他國家和地區，例如台灣。Yelp是一個方便實用的網站，它可以幫助用戶發現和選擇最適合他們的商家和服務。
還有，有一堆data可供研究的dataset可以download。的Kaggle不妨參加。有些人也分享程式碼。還有Amazon Sales Dataset這些structed data比text data容易分析。

all the attribute is cleary defined in advance=structured database
feature=attribute; record=row
在雲端，大數據用MongoDB
Function(1)
Characterization- finding the feature is very important in Data Mining, cll Characterization
Discrimination- comparison , compare different group to see what's the feature in differt group. this clall Discrimination
Function(2)
Accociation and Correlation Analysis
- to find Frequent Itemset
用algorithm = asociation rule
Function(3)

Function(4)
Custer Analysis
有出名的K-my agorithm
Function(5)
Outlier Analysis
Intresting Pattterns = Knowledge

?請說明什麼是Objective measures 和 Subjective measures
?train model
Bing?請簡介什麼是Theory-guided Data Science

Bing？請舉例說明MongDB 適合應用在什麼地方
分類Classification and label prediction
會教導 Decision Tree, naiv Bayes, Nuro network,

W02 Introduction to data mining and data science

2023-09-19二 09:00-12:00 賴錦慧教授資料科學與資料探勘簡介

Reference: CHAPTER 01_INTRODUCTION should study

Ch1本章習題
1.什麼是資料探勘?在你的回答中說明下幾點:
(a)它又是一個天花亂墜的誇大宣傳嗎?
(b)它是簡單地將資料庫、統計學、機器學習與圖訊識別的技術轉換或加以應用嗎?
(c)我們已提出一個觀點來說明資料探勘是資料庫技術演進的成果,你認為資料探勘也是機器學習研究演進所產生的結果嗎?你可否根該學科的演進歷史來驗證此觀點嗎?針對統計學與圖訊識別,也進行同樣的論述。
(d)當資料探勘被視為知識發掘程序時,描述資料探勘所涉及的步驟。

answer
answer

1.2資料倉儲與資料庫有何不同?他們相似之處又是什麼?

answer
answer

1.3定義資料探勘能:特徵化、區化、聯與互關係、分類、歸、分群與離群值分析,使用你所熟悉的現實生活中資料庫,對每一種資料探勘功能指出一個應用範例。

answer
answer

1.4呈現一個資料探勘對於商務運作成功是至關重要的範例,此商務運作需要哪些資料探勘的功能(例如,考慮可能探勘出的樣式有哪些類型),相對地,這些樣可以透過簡單的資料查詢處理或是統計分析來得到嗎？

answer
answer

1.5 請比較下列功能的相似與差異之處:區隔與分類、特徵化與分群以及

answer
answer

1.6根據你的敏銳觀察,請描述另一種可能的知識類型是需要被資料探勘方法發掘但是並未在本章列出,它是否需要使用相當不同於本章節介紹的資料探勘技術呢?

answer
answer

1.7離群值經常被當作雜訊而丟棄掉,但是,一個人眼中的垃圾,可能是另一個人的寶貝。例如,在信用卡交易中的異常,可以幫助我們偵測信用卡詐欺與盗刷。以欺偵測為例子,請提出兩種可以用來偵測離群值的方法,並討論哪較可靠的。

answer
answer

1.8請描述資料探勘在關於探勘方法與使用者互動方面,所遭遇到的三項挑戰。

answer
answer

1.9與探勘少量資料(例如,包含數百個值組的資料庫相較,探勘大資料(例如,十億個值組)最主要的挑戰是什麼?

answer
answer

1.10請說明在定應域下,資料探勘所遭遇的主要挑戰是什麼?例如在串流/感測資料分析、時空性資料分析與生物資訊學等。

answer
answer

W03 Understanding the data

2023-09-26二 09:00-12:00 賴錦慧教授了解資料

Content:CH02-Getting to know your data Slide.25和31-有Excercise#1的題目。
Slide.73-有介紹Minkowski Distance的答案寫法。。
老師在日本參加Conference 本次改為線上上課。
The assignment:ages 25 and 31 of Chapter 2 slide. Please list the calculation process for each question.You need to upload your file to the i-learning before Oct. 2.

作業1:
Slide25:
Measuring the Central Tendency-Example
The values for salary (in thousands of dollars), shown in increasing order: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110
Mean salary= ?
Median = ?
Mode = ?

作業2:
Slide31:
Example of boxplot
The math scores of 12 students:
59, 60, 64, 65, 67, 70, 72, 74, 75, 76, 78, 80
1) Please show the five number summary:
Min= ?
Q1=?
Median=?
Q3=?
Max=?
IQR=?
2) Draw a boxplot

my assignment and a letter to Prof.Lai

Oct. 2, 2023

Dear Professor Lai,
Since I came to Toronto,Canada to visit my mother, I had to take leave from class on Tuesday.
I would like to know if it's possible to see the video of Tuesday's class after. Or, is it exactly the same as Chapter 02 Getting to know your data (part 2)?
And as to the assignment last week, I have a question related to the 'Outlier', that Is the ‘Whiskers’ means the Lower and Upper Bound?
箱形圖（英文：box plot），又稱為盒鬚圖

(可參考phd10-W08-p.50/2-)四分位距的計算範例

Thank you for your teaching.
Warm regards
Hong, Tse Wen - 11204604 tsewen.hong@gmail.com

報告時間表 Presentation2023-11-14二洪哲文報告textbook "Data science for business" Chapter 04 Fitting a Model to Data
書在這裡 Data Science for Business

The requirements for the presentation
1.Please prepare a PowerPoint file to present the details in the chapter in 30 minutes and 10 minutes for Q&A.
2.You can provide additional materials (ex: figures, examples, video, etc.) in the presentation.
3.Share your comments about the chapter.
4.After the presentation, every student needs to upload the PowerPoint file to the i-Learning (Assignments).

W04 Data preprocessing methods

2023-10-03二 09:00-12:00 賴錦慧教授資料前處理方法
Content:Ch03- Data preprocessingSlide.30-有Excercise#4的題目。

知乎大数据与统计学

知乎如何用Excel做描述性统计分析？

請假。人在多倫多探親。但作業交了。

W05 Taiwan National Day (Holiday)

2023-10-10二 09:00-12:00 賴錦慧教授關聯分析

國慶日休假。

W06 Association analysis

2023-10-17二 09:00-12:00 賴錦慧教授關聯分析
Continue: Ch03- Data preprocessing from Slide-
S17-combind 3 file data is by primary key

通知
10/31 will be online course改為線上課！ DataScience
11/07 mid term exam (write test)
preparing your presentation! before that you need to upload your ppt file.

注意 S22-計算實際個數(預測個數) 加以統計計算 > 再查表
X^2越大表示越有關系
要去查Chi square table
Slide23-24 太重要正相關/無關/負相關 (pearson corelation) 找出關係-太重要
covariance 協方差
correlation 相關性
一般情況下常代指的卡方檢定，請見「皮爾森卡方檢定」
Statictis how to皮爾森檢定只須做correlation 相關性計算;

有作業Slide30-要做Co-variance Exercise.
注意: 在加拿大時，漏掉了Ch2-part2的課程和作業,10/24前要交出
今天裝好Weka 資料探勘軟體了。
可查 java -version #java版本須高於8的版本
可查 ls -l /usr/share/java/weka.jar #看有無安裝好了
啟動 java -jar /usr/share/java/weka.jar #圖形化使用者介面。
1.操作教學【資料探勘】UNIT 3：WEKA介紹與基本操作
2.用Weka分類模型來預測未知案例

W07 Association analysis

2023-10-24二 09:00-12:00 賴錦慧教授
Continue: Ch03- Data preprocessing from Slide-31 開始講 Data Reduction 到Slide-82都說完，作業就是Chapter_3

作業2 :Chapter02 已交出，連到GoogleDoc有計算

作業3 :Chapter03 已交出。計算過程可參看作業內容。

聽課筆記
S.41: correlation coefficient (for numerical attributes) Χ2 (chi-square) test (for categorical attributes)
S.42: GPA(Grade Point Average)，就是學科成績的平均績點。
S.44:很重要 Heuristic Search啟發式演算法 in Attribute Selection 就像雷公要打死人，打那一個。Decision tree像我們排業務名次順序。

S.47:母數統計（Parametric statistics）是統計學的一個分支，它假設樣本數據來自母體，而母體可以透過具有固定母數集的機率分布進行充分建模。

S.50. Regression Analysis (find out Best model=best line) Multivariable regression models are used to establish the relationship between a dependent variable (i.e. an outcome of interest) and more than 1 independent variable. Multivariable regression can be used for a variety of different purposes in research studies.

S.53:. 我們做范老師的Assignment#4就是做這個Histogram Analysis (不同bar要分開有間隔)
S.55: Clustering分群 (到第10章會更深入介紹)
S.65: aggregation 聚合, discretization可用在numerical 和categorical data
S.66: 很重要normalization方法 you can choose any one of it to do normalization
S.68: 用hisogram就像S.53 的右圖方法
S.71: 說上週已討論過Smoothing (再解釋一次bin bundaries就是為了smoothing)

linear regression最常用來找(計算)出error，常用法如
least squares method，也就是最小二乘法或最小平方法。是一種數學優化建模方法，它通過最小化誤差的平方和尋找數據的最佳函數匹配。利用最小二乘法可以簡便的求得未知的數據，並使得求得的數據與實際數據之間誤差的平方和為最小。最小二乘法是對線性方程組，即方程個數比未知數更多的方程組，以迴歸分析求得近似解的標準方法。最重要的應用是在曲線擬合上。最小二乘法通常歸功於高斯（Carl Friedrich Gauss，1795），但最小二乘法是由Adrien-Marie Legendre首先發表的。
可參考以下的連結：

最小平方法

MBA智庫百科什麼是最小二乘法

幾種normalisation的方法
normalisation是一種數據預處理的方法，它的目的是將數據轉換為一種統一的標準，例如減少數據的偏差和方差，提高數據的品質和可比性。在深度學習中，normalisation也可以幫助加速模型的收斂，減少梯度消失或爆炸的風險，提高模型的泛化能力。常用的normalisation方法有以下幾種：

(BN) Batch Normalisation：這是一種對神經網絡中每一層的輸入進行normalisation的方法，它通過計算每個batch中每個通道的均值和方差，將輸入轉換為均值為0，方差為1的正態分佈。這樣可以減少內部變異（internal covariate shift），即每一層輸入分佈的變化，從而加快模型的收斂速度，並允許使用更大的學習率。BN適用於固定深度的前向神經網絡，如CNN, 或。

(LN) Layer Normalisation：這是一種對神經網絡中每一層的輸出進行normalisation的方法，它通過計算每個樣本中每個通道、高度和寬度的均值和方差，將輸出轉換為均值為0，方差為1的正態分佈。這樣可以減少不同層之間輸出分佈的變化，從而提高模型的穩定性。LN不依賴於batch size和輸入序列的長度，因此適用於RNN, 或可參考。

(IN) Instance Normalisation ：這是一種對圖像像素進行normalisation的方法，它通過計算每個樣本中每個通道的均值和方差，將像素轉換為均值為0，方差為1的正態分佈。這樣可以減少不同圖像之間像素分佈的變化，從而提高圖像處理模型的效果。IN最初用於圖像風格轉換, 或。

(GN) Group Normalisation：這是一種對神經網絡中每一層的輸入進行normalisation的方法，它通過將每個通道分成若干組，然後計算每個組中每個樣本、高度和寬度的均值和方差，將輸入轉換為均值為0，方差為1的正態分佈。這樣可以減少不同組之間輸入分佈的變化，從而提高模型在小batch size下的表現。GN適用於占用內存比較大的任務，例如圖像分割。
可參考以下連結：

常用的 Normalization 方法：BN、LN、IN、GN（附代码＆链接） - 腾讯云

各种Normalization - 知乎

【深度学习】常见的Normalization方法及其总结 - CSDN博客
, 或可參考
, 或可參考

極值正規化min-max normalization是簡便的方法：例如有10人月薪要比較，可用此法讓每人月薪都化為1-0間的比率數字。
new v1=(v1-min)/(max-min) x (1-0)+0 ; 其中1和0 是你設想的new max和new min. (maping 到新的範圍去檢視)

discretization 離散化：像bining就是一種，總之就是把一堆數值，用區間標籤分開。主要分為兩種方法：Classification(如 decidion tree analysis)和Correlation analysis(如chi-merge: x^2-based discrelization)

最小平方法least squares method又稱最小二乘法：是一種數學優化建模方法。它通過最小化誤差的平方和尋找數據的最佳函數匹配。

W08 下週Midterm 期中考試

2023-10-31二 09:00-12:00 賴錦慧教授

老師有事，改為線上課！

好像是跳過Chapter4,5，開始講Chapter-6 進階樣式探勘MINING FREQUENT PATTERNS, ASSOCIATION AND CORRELATIONS！
Slid.26有Exercise#1，和Slid.27#2。要上傳交作業。

參考中譯課本Chpter6-進階樣式探勘自習用。
這書附了光碟，第6,8,10章有pdf檔。

參考中譯課本Chpter8-分類、進階方法自習用。

參考中譯課本Chpter10-進階群集分析自習用。

2023/11/06前應交出，已交出的作業，寫在Googledoc上的作業 | 作業pdf檔

W09 Mid-term week 期中考週

2023-11-07二 09:00-12:00 賴錦慧教授
Reference:subject should study
DataScienceForBusinessChpt4 Fitting a Model to DataPDF檔。 [html檔]可幫忙朗讀唸出。 | Googledoc的簡報檔 | 試算表-讀書紀錄檔 |

期中考！(後，給老師的信)

老師好.
過了50年重新回到學校，學習新知、參加期中考試，真是令人興奮、刺激的事。
考試用英文作答對我是新的經驗，回來後想到幾件事：
1.第一題，我的意思是make better prediction and decision 但寫成judgement.
2.好久沒用計算機，昨晚在家裡找到一個，早上才想到普通計算機只有開平方，但已不及更換.
3.normalization很基本，自己也以為都會了，所以這幾天根本沒想到要複習(自己測試)，開車時才想到答案; 也想到：我一直看著中譯本的書，和老師的英文ppt，由於名詞很多，有好多個名詞一時間連繫不上，這點自己要多注意。
之所以給老師寫信，主要是想說：
我很高興有這樣的考試，因為透過這樣的方式，才有機會更清楚自己不足，或不夠熟悉之處，而在考卷上寫些理由實在沒有必要，老婆看了都笑話我(我有照相給她看)，所以請你照正常的、嚴格的標準打分數。之所以寫些「有的沒有的」以及這些考後心得，只是想表達，上您的課我覺得收穫滿滿，謹此再次謝謝老師。

洪哲文敬上

關於下週Presetation的準備

免費線上的數學工具desmos 除了desmos以外, 還有GeoGebra, Plotly也都是•免費線上的數學工具，它可以讓你在網頁上輸入數據或函數，然後繪製出對應的圖形，包括向量、線性回歸、曲線等。

ChatPDF suggest outline of presentation about Chapter#4:
Introduction to Data Mining Techniques for Fitting Models to Data
1. Definition of data mining and its importance in modern data-driven industries
2. Overview of the fundamental concepts of fitting a model to data
3. Explanation of objective functions and loss functions in data mining
4. Introduction to linear regression and its use in predictive modeling
5. Overview of support-vector machines and their common applications
6. Discussion of the importance of choosing the right model and parameters for a given dataset
7. Conclusion and summary of key takeaways from the article
Note: Depending on the specific content of the article, you may need to adjust the outline accordingly. Additionally, 30 pages is a relatively long article, so you may want to consider breaking it up into multiple slides or sections.

參考人家Youtube上，錄製好的PresentationModel Fitting and Experimental Modeling Part 1: Introduction也可先錄起來看看再去講

W10 Classification and prediction models

2023-11-14二 09:00-12:00 賴錦慧教授分類與預測

Presentation : Data science for business
CHAPTER 4 : Fitting a Model to Data (Presetation)

1.期中考成績44分，今天發回考卷，老師帶領解題，檢討錯誤。mid-tern paper期中考卷
2.講講Chapter-6 -Slide.28-FP-Growth Approach (從28到50頁)介紹Frequent Pattern-Growth Algorithm.
到slide49-要重新練習Example2,要自己會做。
3.第三節我要做報告: 洪哲文的presentation: Presetation-GoogleDoc.

準備期中考時，回看Chpt#3的一張Slide，覺得有一點收穫:
Chp3-S21:Basic Statistical Descriptions of Data
Motivation
- To better understand the data: central tendency, variation and spread
Data dispersion characteristics
- median, max, min, quantiles, outliers, variance, etc.
Numerical dimensions correspond to sorted intervals
- Data dispersion: analyzed with multiple granularities of precision
- Boxplot or quantile analysis on sorted intervals
Dispersion analysis on computed measures
- Folding measures into numerical dimensions
- Boxplot or quantile analysis on the transformed cube

"Task-Technology Fit" (TTF) 模型是資訊系統領域的理論模型，主要用於評估技術（如資料探勘工具或軟體）如何以及在何種程度上適應或支援特定的任務和使用者需求。這個模型的核心觀點是，當一個技術的功能特性與它要完成的任務需求相符時，技術的表現效果會更好。

在資料探勘的背景下，TTF模型可以應用於研究不同的資料探勘技術和工具如何適應特定的資料分析任務。資料探勘通常涉及從大量資料中提取有用資訊和模式，這包括分類、聚類、關聯規則學習、回歸和異常檢測等任務。根據TTF模型，選擇合適的資料探勘工具應基於以下幾個方面：

任務特性：明確資料分析的具體需求，如預測、分類、辨識異常模式等。
技術特性：評估可用工具的功能，例如演算法的複雜性、可自訂性、處理大數據集的能力等。
使用者和環境：考慮使用者的技能和經驗以及環境限制（如時間、預算和資料安全性）。

例如，如果一家公司希望透過客戶資料來預測銷售趨勢，根據TTF模型，它們應該選擇一個強大的預測分析工具，該工具不僅能處理大量數據，而且具有高級的預測建模技術，例如機器學習演算法。同時，這個工具也應該是用戶友好的，讓公司的分析師能夠輕鬆地配置和解釋模型結果。

在實施TTF模型時，研究人員或從業人員通常會使用問卷調查、訪談和案例研究來評估特定技術與既定任務之間的匹配程度，並據此提出改進建議。

總的來說，TTF模型強調在選擇和應用資料探勘或其他資訊科技工具時，匹配任務需求和技術能力的重要性，以提高工作績效和使用者滿意度。

上課筆記:
討論Algorithm
1.先分析並建立樹狀結構
Divide-and-Conquer '把問題或目標:切成小塊,一步一步解決'
tree從root開始，須知什麼是subtree,...
先要知道定義-就可照著邏輯走.
s.33-
1.先把所有item列出來，然後找出frequency,
2.定下shreshold 比如min_support=3 非3的就淘汰
3.重整frequent items.(ordered) 與做出Header Table(降序)
根據Table做出F-list
4.重新掃描DB 建立樹狀結構(從第一筆trancaction, 第一個frequent items做起)
每找到這個item,就在count上+1,若有新的item,就增設一個item並在count上計1;
5.接著做第2筆.
參考Youtube FP-Growth Algorithm 的介紹

同學做presentation第chpt3後，老師問?
- do you know the entropy?
- entropy is heigh or low represent what? what it mean?
- how about IG (information gain) ? if it's heigh? is it good or not?
- This chapter mention: supervise classification, this is important.

- must allow misclassification
- Choosing a threshold that allows misclassification is an example of the Bias/Variance Tradeoff that plagues all of machine learning.

W11 Mining frequent patterns

2023-11-21二 09:00-12:00 賴錦慧教授群集分析方法
1.講講Chapter-6-FP-Growth Approach (今天從51講到70頁)介紹Frequent Pattern-Growth Algorithm.

FP-Growth Algorithm
最後討論:S.71-lift 注意:就是來算 corelation of 2 items: 見S.71
可以先用IR(A,B)去看null-invariant measure 再Kulc

2.今天輪到我做報告:洪哲文的presentation: Chapter4: Fitting a Model to Data-GoogleDoc.
本文 [PDF] 在這裡，還有[html] 版。

W12 Classification and prediction models

2023-11-28二 09:00-12:00 賴錦慧教授群集分析方法
1.今天不講FPattern Evaluation Methods.了(自己去研究)，開始講

講CHAPTER 08 CLASSIFICATION: BASIC CONCEPTS .
(參考-資料探勘中譯本-p7-1 分類-基本概念 )

Slides.:
08: Training set (data 70-80%) to train: Test set(data 30-20%)
09: Validation set: [ Training set | Test set ]
[ Testset ] > Validation (若做validation)會分成兩組驗證組和 testing set
[ Testset ] > testing 像decition tree 不需要, deep learning一般都需要.

11.: 因為規則是訓練出來的有時不免有錯，但訓練完了後可以輸入unseen data
12: 黃色的note代表 attribute; 各個attribut的重要性要(會)排出順序來, (成為model)
16.再把model去做test data
19.做出的樹也可能長得不一樣
21. 找出會買電腦者： how to decide most important attribute (node)
22. 因為一再探索,也叫做greedy 演算法. 每個record(tuple)根據attribute重要順序去做判斷
entropy, information gain, 都是些幫助決定重要性的方法
23. 'majority voting'
24. 須知道這四類如何做
25. nominal attributes: 如果多項就要設法分成只有兩組
26. Ordinal attributes: 都是要從multi-way split 轉為 binary split. 但order不能錯
27. Continuous attributes: 有時也會做成multi-way split
28. 做bin 用equal width或frequency...等可以split
29. 怎樣才是最重要的分法: 練習題原來分好的co c1各10; 注意Gender應該是M或F 才對不是yes no;
作法：三個attribute去算出數字;
因為 car type 分離情況大所以這個最好最重要; 而gender太平均太接近; ID不具參考性;
distinguishable = majority voting
30. distinguishable = purity (兩者太相似-數值太接近-就是impurity)
31. 有兩個公式Entropy 或 Gini index 可以計算impurity (要看你是那種TI ) 32. 找出前後的不純度再算Gain = P -M 如果Gain高的就是好的spliting 33. 範例可用兩種分法會算出不同的不純度M1 M2這時就可以用P-M1和P-M2去做比較
34. Attribute Selection in Decision Tree 35. 三種ti用不同的measuement去做的 36. ID3: 用informaton gain去做的; D是total number of recorders; 要去算出Entropy; 然後算出after spliting的 37. 看範例比較清楚一點; 先算出probability of Yes or No; 回去用entropy公式算; 重要公式; 有4個attributes, 看age的算法 p1(有2個yes) n1(有3個no); 像這樣的畫出圖來計算entropy; 三個都要算出來.
根據不同的attribute可算出Info結果M; 最後就可以用P-M去算出 Gain(attribue)的數字;
4種attributes都要算出來; 因為Age數字最高, 其他三個都差不多; 所以age最重要; 來做split (有了entropy才能計算infroamtion gain)
38. 這樣就會得到三個information gain; 然後向下一而再的重複做recursivly,所以才叫做greedy
39. 接下來談Continuous「計算連續值屬性的資訊增益」Computing Information-Gain for Continuous-Valued Attributes; 因為數字連續，所以要找個點來切開; 比如A，那就可以切出D1和D2兩組;

老師說 Today, you have no assignment. 我說 THANK YOU VERY MUCH. 大家都笑了

2.接下來德國女孩報告: Chapter5. 講得非常好!

W13 CLASSIFICATION:Gain Ratio (C4.5) & Gini Index (CART) etc.

2023-12-05二 09:00-12:00 賴錦慧教授文字探勘
1.(1)越南同學報告了Chapter6 用了很好的圖片資料 17 types of similarity and dissimilarity measures used in data science. 向他請教，得到此連結。
(2).⏰關於期終報告 | 看說明 | 參考範例格式 | 這只是個poposal不是thesis所以，倒是不用對methodology一步步清楚的說明。不過這是好機會練習。你可以從你熟習的domain出發，做研究。因為下一年開始你就要開始準備論文洪哲文是2024/01/02要報告20m。使用APA Format ，參考APA 格式。
(3)接著另一(好像是)泰國同學 Naprt 同學報告Chapter7:

2.今天要接著講CHAPTER 08 CLASSIFICATION: BASIC CONCEPTS 應從Slide 40講起.
S40: 回頭看一下S35.你需要知道information gain公式，接下來要介紹Gain Ration C4.5。本頁介紹公式的計算方法。請參考S37的Information Gain的數字，選Income這個attribute，算個數-先做出一個像下方那個表格:
III

income Yes No I(Yes, No)

low 3 1

medium 4 2

high 2 2

然後，再來套公式做計算。因為有14筆資料。所以一旦算出Gain0.029，就可算出GainRatio 0.019
因為The attribute with the maximum gain ratio is seleted as the spliting attribute.
要重複學習怎樣計算Information Gain
S41: Gini Index (CART)
這Gini Index是代表impurity,所以The attribute provides the smallest Ginisplit(D) (or the largest reduction in impurity) is chosen to split the node (need to enumerate all the possible splitting points for each attribute)
這個 7 是去找income是{low, medium}且buy computer是Yes。
S42: 三種不同組合都算出GiniIndex然後選那個最小的數字，來做split
S43: Gini Index 注意這幾點
S44: 跳過此頁
S45: Other Attribute Selection Measure
S46: 已講完decicion tree 接著要講 Model Overfitting
S47: if over fitting then: Poor accuracy for unseen samples
S48: there will be 3 kinds of error: Testing, Test, Generalization erros.
S49: 看這個範例，藍色是5400個樣本, 這很怪因traning和teting data比率相反的極端。
S50: 當decesion tree的nodes增加時，erro變成decrease.
S51: 看起來好像error少了比較好，但如果這樣說是對的、那到最後成了o error最好嗎？不見得。
S52:
S53:
S54: 藍色線是over fitting 紅色線是testing error，請看兩者的underfitting 和overfitting解釋
S55: 增加tainging data是改善的辦法
S56: 之所以會overfittin 可能原因是training data不夠(通常要70-80%是training data)，另一個原因是太過複雜。
S57: Tree pruning 修剪/砍掉些枝節吧！
S58: 看個例子吧，用不同的algorisum 可以去修剪不同的node
S59: 跳過這頁
S60: Classification in Large Databases 決策樹decision tree簡單好用易懂，所以普遍流行，甚至用在大型database
S61:
S62:
S63:
S64: 要講跟於條件機率的Bayes Classification Methods (參考:條件機率貝氏定理108新課綱高中數學)
S65: Bayesian Classification: Why?
S66: 複習Bayes’ Theorem: Basics 和公式
H是hyposes 而x是我們的data sample
因為無法知道H所以，要用間接的方法，計算出這個猜測的機率。 S67: 進一步說明Bayes'Theorem
S68:
S69: Naïve Bayes Classifier 要自己再去練習。
S70: 範本學習想要求出未知的X
S71: 計算過程如下 (回去重算過-翻中文過/再自己計算過-再用英文做一次) 就是要算出Yes0.028 與No0.007，看那一個的數值高，X就是取高的那個就是。
S72: Avoiding the Zero-Probability Problem 要小心，避開 'O機率' 的錯誤，方法就是碰到0筆的資料(數字)，就給他算是 1 (避開用0)就沒問題了。
S73: Naïve Bayes Classifier: Comments 這裡就衍生出How to deal with these dependencies? Bayesian Belief Networks (Chapter 9)

下次上課從這裡講起，本週沒作業！
S74: Rule-based Classification

我已經在2023/10/17安裝好Weka了(見Logseq日記)，啟動指令$ java -jar /usr/share/java/weka.jar (圖形化使用者介面)。
有操作教學影片Weka Tutorial |
也開始試著整理公司交易資料來跑看看，因為期末作業也要需要提出(應用計劃）報告。要跑weka前需先整理過資料成.csv檔。( hgpt_1.php) 。

W14 Rule-based Classification

2023-12-12二 09:00-12:00 賴錦慧教授文字探勘
1.(1)我把前兩週講Decision Tree Algorithm的內容，整理了一個比較表(請翻看DTalgorithm工作表)。
(2).今天CHAPTER 08要從S74開始講 Rule-based Classification 。

S74:
S75:
S76: prevalence
S77: 用DT最容易找出規則
S78: 有了規則(如上述if-tehn)就可作分類計算
S79: 還有其他歸納出規則的方法，這裡不多述，DT最好用我們跳到83頁去
S80:
S81:
S82:
S83: 今天介紹常用這3種 evaluation methods 還有2種
S84: 這是監督式方法所以左直行表示資料真正分類，右欄是驗證出來的TP true positive 或其他情況，範例就是計算出來的數據
S85: Evaluation Measures (非常重要的公式)可以用這些公式算出
S86: Classifier Evaluation Metrics: Accuracy, Error Rate, Sensitivity and Specificity 計算過程 All是All test data number; 解釋HIV和imbalance
做Classifier Evaluation Metrics: Accuracy計算時有些Class imbalance problem請問 HIV-positive 的HIV是什麼意思? S87: **我的資料是空白** 但老師 precision and recall and F-measures 很重要是imbalance要用到的 precision和recall 是相斥的所以又有學者發明下一頁的方法
S88: F1 和 F-score 老師的研究總是用F1 來做衡量; 這方法是加上了權重來計算
S89: 試算給你看
S90: 來做一題看看
S91: 介紹Holdout method 是自動隨機分出Training和test可以重複的做k次已確定accuracy的平均數
S92: 有更細心的holdout method 加上validation
S93: Evaluating Classifier Accuracy; k-fold, where k = 10 is most popular
S94: 跑了十次的示意圖
S95: 另一種holdlout method的範例; Example of Cross-validation; 藍色的是用來做validation的那一組;
S95:
S96: Evaluating Classifier Accuracy 為了給for small sized data
S97:
S98:
------------新ppt的----------------
S99: Evaluating Classifier Accuracy: Bootstrap (63.2%training set) 重要區別是with replacement
S100: Significant Tests如果要比較那個model比較好
S101: Estimating Confidence Intervals: Classifier Models M1 vs. M2 拿erro rate來做比較
S102: 用t-test來做檢定所以需要先做個假設H-null再來求證 (自己去學統計)
S103: t test
S104: 查表
S105: Estimating Confidence Intervals: Statistical Significance 就可絕地同意或推翻假設
S106: Model Selection: ROC Curves 另一種method 拿TP和FP來做比較
S107: Model Selection: ROC Curves就是用中間紅線的左右邊判定 accuracy高低
S108: Plotting an ROC curve 注意true positive rate和新S86-percising比較看看
S109:
S110: Issues Affecting Model Selection 影響model選擇的因素
S111: Techniques to Improve Classification Accuracy: Ensemble Methods 整合方法介紹
S112: 用整合方法已增加正確性; Bagging,Boosting,Ensemble 以下逐一介紹; boosting就是bagging加上weight
S113:
S114: boosting就是bagging加上weight
S115:
S116: Random Forest (Breiman 2001)
S117: Classification of Class-Imbalanced Data Sets
--------
回頭看新ppt的 s91做excercise S118: Summary (I)

W15 Text mining

2023-12-19二 09:00-12:00 賴錦慧教授文字探勘
1.今天兩個越南同學 --- 先報告Chapter8:
2.邀來:陽明交大科管所長Dr.SJ Lee 李昕潔演講Artificial Intelligence & Smart Healthcare新講義ppt 尚缺待補。
-可參考去年2022來演講的紀錄李昕潔教授-講Artificial Intelligence | 聽講筆記 | 演講摘要整理

Dear all,
I will invite a professor to come to our class and give you a speech next week at 10:00AM -12:00PM.
After you have listened the speech, everyone need to write a report on your thoughts about the speech.
Please upload your report to the i-learning system before Dec. 26.
In addition, two students will have to make class presentations before the speech.
Please come to class on time.

The following is the speech information.

Title: Artificial Intelligence & Smart Healthcare
Speaker: Dr. Shin-Jye Lee (李昕潔博士)
Professor, Institute of Management of Technology, National Yang Ming Chiao Tung University

Introduction of speaker:

Shin-Jye Lee is currently an professor of Institute of Management of Technology in National Yang Ming Chiao Tung University, Taiwan. He was a professor at the National Pilot School of Software, Yunnan University, China, and he also made his academic career in Poland and Taiwan successively. Meanwhile, he received his MSc (Eng) degree from the Department of Computer Science, University of Sheffield, U.K., in 2001, MPhil degree from the Judge Business School, University of Cambridge, U.K. in 2012, and PhD degree from the School of Computer Science, University of Manchester, U.K. in 2011, respectively.
In addition, he also had practical experiences in Fujitsu and Microsoft, form 2002 to 2005. Further, his research interests primarily comprise machine learning, computational intelligence and decision support system, operational research, and technology policy, especially for the climate change issues and energy prediction.

Took note of the speech

S1:
S2: Pro. Lee got the PhD. degree and said that In 2011 hard to find a job for a AI researcher; 2013 Industry 4.0.
S3: What is AI? make machine learning like human being. baby have no knowledge of this world, they learn. (ChatGPT is one of the practical of AI) -image(如臉部辨識-差不多都突破了) and nature language
AI是應用-技術是基於ML和DL; ML有很多DT By. Fuz. NN. etc.
DL is branch - NN deep learning. 可以做
S4: 去看那個電影A.I.: robot 長大了但他不知自己不是人類. *AI is application of machine learning.
S5: Weak AI 所有的都是學來的. 我們現在還是在這個階段歐
S6:
S7:
S8:
S9: Can Robot really think?we know the 過程理解但是NN無法
S10: Allen TuringFather of computer sicence, AI Turing Test
S11:
S12:
S13:
S14: How to Apply Big Data under the Uncertainty? 現實世界有太多資料data,BIG DATA, 大多數的data我們並不清楚not clear(noisy ) different series的教授用不同工具去分析這些data以便做出分析
S15:
S16: By Machine Learning (德國女孩Elena問RCNs: 可參考How Do RCNs Learn?. Replacing gradient descent with…); series of behavious = patern, = frequent;
S17:
S18:
S19: 4 Types of Machine Learning Algorithms (chatGPT base on Supervised learning) -large languqage model, +reinforement learning.; Aophago=rinfr.+supervise.
S20: Supervised Learning we know the result first: (做完考卷後-有答案卷solution paper可以核對) 處理clasification progblem(其結果是binary回答:是或不是,男或女,接受或不接受;)與Rgression Problem(有一堆數字找出關係來)其結果是數字答案
S21: 這是一種處理Classification Problem 的方法
S22: 而處理 Regression Problem 的話，其結果就是可以輸出-數據資料
S23:
S24:
S25: Clustering Problem 和S21比對起來這是沒有lebel，即使分類出來也是沒有lebel名稱的; 像E-Gate就是Supperivsed learning，而e-commerce這個就常用到clastring，你的購買行為被歸類了
S26:
S27: 51:00-用在smart health care: 世上的real data根本沒有label，醫學上需要利用已有的few data想辦法去對照分類一大堆data 這時需要用到semi-supervised learning;
S28: 上S.27的過程舉例
S29: Reinforcement Learning 做模測驗;醫生時間不夠用，判定一張圖可能要三四分鐘，但有千、甚至萬張圖。所以可以判別幾張後，用ML幫忙traing出model來。
because we don't know our feature, so we can just learn from experiment, try, train 請看下面的訓狗圖
S30: Reinforcement Learning in ML 訓狗圖
S31:
S32: Transfer Learning is a research problem in ML that focuses on storing knowledge gained while solving one problem and applying it to a different but related problem.
S33: learning from a large amount data/labels
S34:
S35:
S36: ANNS Artificial Neural Networks – How does the Brain work?
S37:
S38: Perceptroninput 乘上 weights
S39: The Machenism of ANNs
S40:
S41: Bayesian Networks 貝葉斯定理與公式，求解
S42:
S43: 用 Naive Bayes Classifie 做分類器
S44:
S45: A Hierarchy of Naive Bayes Classifier in a Yes/No Problem
S46:
S47:
S48:
S49: (SVM) Support Vector Machine
S50:
S51: Linear Separability (線性可分、不可分)
S52:
S53:
S54: Fuzzy Systems 就是if..then系統
S55: Fuzzy和Boolean的異同
S56:
S57:
S58:
S59:
S60:
S61:
S62: Simple Structure of Fuzzy Systems
S63: The Mechanism of Fuzzy Inference Systems
S64: Fuzzy System Identification
S65:
S66:
S67: “ Artificial life
(also known as AL, A-Life, or Alife) is an interdisciplinary study of life-like processes using a synthetic methodology”
S68:
S69: Strong AL vs. Weak AL
he strong alife - "life is a process which can be abstracted away from any particular medium" 堅強的生命——“生命是一個可以從任何特定媒介中抽象化的過程”
The weak alife - denies the possibility of generating a "living process" outside of a chemical solution, and tries to understand the underlying mechanics of biological phenomena.弱生命－否定在化學溶液之外產生「生命過程」的可能性，並試圖理解生物現象的潛在機制。
S70: Conway’s Game of LifeThe universe of the Game of Life is an infinite 2-D orthogonal grid of square cells, each cell with two states, alive or dead, or "populated" or "unpopulated“ 生命遊戲的宇宙是一個由正方形單元格組成的無限二維正交網格，每個單元格都有兩種狀態：活著或死亡，或“有人居住”或“無人居住”
S71:
S72: Game Reward A Reward Policy of Game of Life
S73:
S74:
S75: Smart Healthcare Powered by AI
S76: OUTLINE RISK PREDICTION

實際研究案例:
1.RISK PREDICTION OF THYROID CANCER ON ULTRASOUND, WAN-FANG 超音波甲狀腺癌風險預測，台北萬芳醫院
2.RISK PREDICTION WITH PEDIATRIC ECHOCARDIOGRAPHY, KAOHSIUNG CHANG-GUNG 兒科風險預測-高雄長庚超音波心臟檢查
3.FUNDUS PHOTOGRAPH-BASED DEEP LEARNING ALGORITHMS IN DETECTING DIABETIC RETINOPATHY 基於眼底照片的深度學習演算法檢測糖尿病視網膜病變

S77: CASE STUDY -RISK PREDICTION OF THYROID CANCER ON ULTRASOUND 超音波甲狀腺癌的風險預測
S78: THYROID NODULES 甲狀腺位於頸部底部，喉結下方。
S79: 食慾正常體重突然下降。且有時心臟狂跳或難入睡、肌肉無力、緊張煩躁->該去看醫生啦。
S80: Risk Prediction of Thyorid Cancer on Ultrasound 取得超音波的圖像後，可以與健康的圖像比對; 也可用在健檢對照; 可能導致甲狀腺結節的原因包括：甲狀腺組織過度生長、甲狀腺囊腫、甲狀腺慢性炎症、多結節性甲狀腺腫、甲狀腺癌。
S81: DIAGNOSIS 在評估頸部腫塊或結節時，醫生的主要目標之一是排除癌症的可能性。醫生也想知道甲狀腺功能是否正常的測試包括：體檢、甲狀腺功能檢查、超音波、細針抽吸切片、甲狀腺掃描。
S82: Bethesda System (1/2) 貝塞斯達系統TBS原是1988年推出報告子宮頸或陰道細胞學診斷的系統，2010年起，還有一個用於甲狀腺結節細胞病理學稱為The Bethesda System for Reporting Thyroid Cytopathology（TBSRTC或BSRTC）的系統。
S83: (2/2) 都是 NIH 贊助的研究會議的成果。
S84: TI-RADS™ THYROID IMAGING REPORTING & DATA SYSTEM甲狀腺影像報告和數據系統
S85: 配合超音波的圖形，協助醫生進行診斷。
S86: 流程: 影像-NEURAL NETWORK-Deep Learning-計算TI-RADS NODULES的數量與尺寸
S87: CASE 2-兒科超音波心動圖 RISK PREDICTION WITH Pediatric ECHOCARDIOGRAPHY
S88: 心臟運作的示意圖
S89: 醫生標出八個點計算期間距離來判斷，但甲狀腺癌是圈出部位圖像，和健康的做比對
S90:
S91: EJECTION FRACTION (1/4) 缺instance segmentation圖 (先做outline 再做內部的度量)
S92:
S93:
S94:
S95:
S96: 左心室的 M 模式影像顯示心室壁、左心室腔的尺寸和心臟功能測量。 y 軸表示與感測器的距離（以毫米為單位）；時間（以毫秒為單位）位於 x 軸上。 M 模式影像顯示整個舒張期 (d) 和收縮期 (s) 的 LV AW、LV 室和 LV PW。收縮期沿 PW 可見的迴音高峰代表乳頭肌進入視野。
S97:
S98:
S99: 這是親眼可見到血管的圖像: CASE STUDY -FUNDUS PHOTOGRAPH-BASED DEEP LEARNING ALGORITHMS IN DETECTING DIABETIC RETINOPATHY
S100:
S101:
S102:
S103:
S104:
S105:
S106: 青光眼檢驗儀很貴，但可以用數字模擬方法來判定就是加上deep learning方法為圖形上色, 這方法還可以用來做體檢
S107: Smart Manufacturing Research Works
S108: Fintech Research Works -
S109:
S110:
S112:
S113:
S114:
S115:
S116:
S117:
S118:
S119:
S120:

lack of slides: (I need the new ppt file of 2023) 2022 ppt沒這一張睡眠測試的:
5.Analysis of the severity of sleep apnea based on PSG physiological parameters (by EEG)

orthopedics 骨科
ATI 與 Nvidia 之爭
Nvidia 可以賣GPU去中國了嗎？什麼規格可以賣

W16 Recommender systems and applications

2023-12-26二 09:00-12:00 賴錦慧教授推薦系統與應用
1.今天CHAPTER 10要開始講 Chapter 10- Cluster analysis 。
Because of time limited, just introduce concept and most important algorithm.
S01: 時間只集中在兩個 Partitioning approach, Hierarchical approach.
S02: What is Cluster analysis -different from Calssifycation, withou 'lebel'.
clollection of dat objets. 之前學過的計算距離方法可用來找similatrity.
是一種unsupervised learning. learning by obsevation. 不同於learning by examples.
S03: Typical applications
As a stand-alone tool to get insight into data distribution
As a preprocessing step for other algorithms
S04: Clustering for Data Understanding and Applications
S05: Clustering as a Preprocessing Tool (Utility)
S06: Quality: What Is Good Clustering?
S07: Quality: What Is Good Clustering?
S08: Measure the Distance among Clusters
不同的data type使用不同的distance，有時還要加上權重weight
S9: Measure the Distance among Clusters 有四種距離需要計算以做判定
S10: Measure the Quality of Clustering
S11:
S12:
S13: Major Clustering Approaches (1/3)Partitioning approach 重點在找出square error?
Hierarchical approach 像是樹狀結構,從底部開始找起往上生出樹結構，用到像Diana 或BRICH的算法
S14: Major Clustering Approaches (2/3)
S15:
S16: Partitioning Methods
S17: The K-Means Clustering Method
Partitioning Algorithms: Basic Concept 想找出最小的distances for each cluster
一開始先假設一個k (想要分出幾個cluster?)開始嘗試，這是try and error 過程。findibg k is dificult always.
k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented by the center of the cluster
k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster
但k-Means最有名
S18: An Example of K-Means Clustering 從k=2開始，反覆計算更換center直至最小 (有時可用基因演算法去找出一個k來開始)
S19: Comments on the K-Means Method 弱點是只適合numerical data所以要用別的來補強
跳過來講S83: Variations of the K-Means Method
S20: What Is the Problem of the K-Means Method?
S21: PAM: A Typical K-Medoids Algorithm 利用total cost來恆量
S22: The K-Medoid Clustering Method 適用小型dataset 不適大型dataset
S23: Hierarchical Methods
S24: Hierarchical Methods 有AGNES和DIANA兩種，順序完全相反。這沒有假設k值的問題
S25: AGNES (Agglomerative Nesting)
S26: Dendrogram: Shows How Clusters are Merged
S27:
S28: Dendrogram: Shows How Clusters are Merged 由下而上須決定cutting point
S29: DIANA (Divisive Analysis)
S30: Distance between Clusters 有五種距離，在Hirarchid需要選擇一種來計算
S31: Examples of Hierarchical Clustering 先做表，可看到所有點對點的距離。請看座標位置，根據公式算出距離。根據這distance matrix就可custering (比如要用Single link看上一頁的定義)
S32: Single Linkage Method 合併出cluster的過程在這裡，最後看你要把cutting point設在多少
S33: Complete Linkage Method
S34:
S35:
S36: Extensions to Hierarchical Clustering k-means比hirarchical快多了
還有其他兩種算法BIRCH (1996)和CHAMELEON (1999)
S37: 時間關係暫在此打住，但最重要的就是把K-Means先熟練

2.今天兩個越南同學 --- 先報告Chapter9. Evidence and Probabilities, 這是個很好的Case study，教你從頭做起，幫你考慮怎樣應用Data Mining在商業上。
和Kristian報告 10. Representing and Mining Text.:

3.講完還有點時間，老師接著講最後一章(11) Introduction to recommender system 推薦系統的介紹。
S1:
S2: Recommandation is everywhere
S3:
S4:
S5: Recommender Systems 像Tripadvisor
S6: In the Social Web
S7: What is recommendation system?
S8: Why using Recommender Systems? 為了 save your time to spent more money
S9:
S10: Problem domain
S11: Recommender systems 可看做是一個function，需要為每個item準備好推薦的資料、話語。
RS: Given-Find-Finally 這是說明RS的運作的基本觀念
S12: Paradigms of recommender systems 什麼都是根據分數、數字
S13:
S14:
S15:
S16:
S17:
S18: Recommender systems: basic techniquesCollaborative最重要
S19:
S20: Collaborative Filtering (CF)
S21: User-based nearest-neighbor collaborative filtering (1/2)這是Rating matrix
S22:
S23:
S24:
S25:
S26:
S27: Application: Netflix 去看影片介紹
S28: 到此打住

W17 Final project presentation 期末報告

2024-01-02二 09:00-12:00 賴錦慧教授
1.今天 ⏰Final Project 洪哲文是2024/01/02要報告20m 請看說明，報告請參考範例格式。可以再參考W13-1.(2)。 | gDoc計算表 | 準備中的材料 | 完成報告 |
2.參考資料
1).關聯規則學習
2).K-means算法与关联规则挖掘Apriori算法
3).[Day 6] 非監督式學習 K-means 分群
4).用Weka對資料集進行關聯規則分析！

▼w17_1 為Final Project準備的一些papers

DataScience的Final Project準備的一些papers

□ 2007的論文 Certification of Condition Monitoring Personnel and how it Relates to NDT。2002年以來人員培訓和認證的發展-英國無損檢測協會BINDT已克服的問題以及問題提供了一些見解。
□ EQUIPMENT VIBRATION CONDITION MONITORING TECHNOLOGY BASED ON SPECTRUM IMAGE DEEP LEARNING MODELS
□ 中原大學A data mining model to identify inefficient maintenance activities

🙋Experimental Data Mining Research on Factors Influencing Friction Coefficient of Wet Clutch
🙋Prediction of Lubricant Service Life Using Data Mining to Improve Reliability of Water Injection Pumps in Crude Oil Production Facility
🙋Effects of Data Preprocessing on the Prediction Accuracy of Lubricant Service Life of Water Injection Pump in Enhanced Crude Oil Recovery Facility
🙋Application of data mining for spare parts information in maintenance schedule: a case study
🙋ISO Standards for Condition Monitoring
Deformation and friction of MoS2 particles in liquid suspensions used to lubricate sliding contact

🟢IEEE 2016 🙋用了Apriori methodi-RCAM: Intelligent expert system for root cause analysis in maintenance decision making
🙋Wakiru, J.2017a Analysis of lubrication oil contamination by fuel dilution with the application of cluster analysisKey words: Used oil analysis, Lubricant condition monitoring, Dilution, Cluster analysis; (LCM)Lubrication condition monitoring; (UOA)Used Oil Analysis;
🙋Wakiru, J.2017b A lubricant condition monitoring approach for maintenance decision support - a data exploratory case study
Wakiru, J.2018 A decision tree-based classification framework for used oil analysis applying random forest feature selection
🟢🙋Wakiru, J.2019 A review on Lubricant Condition Monitoring information analysis for maintenance decision support 我的興趣與希望的貢獻亦如此：The current findings add to a growing body of literature on (LCM) Lubricant Condition Monitoring and maintenance decision support.
🙋Wakiru, J.2021 Journal of Quality in Maintenance Engineering A data mining approach for lubricant-based fault diagnosis
🟢🙋ACTA POLYTECHNICA. 2021 ASSESSMENT OF THE EFFECTIVENESS OF LUBRICATION OF TI-6AL-4V TITANIUM ALLOY SHEETS USING RADIAL BASIS FUNCTION NEURAL NETWORKS 用了Apriori method

🟢🙋2022Data mining in predictive maintenance systems: A taxonomy and systematic review評論2015-2021年間在高影響力期刊的132篇文章，驗證了PdM Predictive maintenance是個新興且非常活躍的領域，由於工業4.0範式帶來的監控進步以及預測模型和計算的進步，許多出版物呈指數級增長。

Corrective maintenance (CM), Preventive Maintenance (PvM), and Predictive Maintenance (PdM). Machine Learning (ML),
Regression and multiclass classification are popular approaches in PdM.
the review introduces a taxonomy for data mining in predictive maintenance and discusses challenges and future research directions in this field.
🟢🙋IEEE Xplore 2023/04 Gas turbine vibration monitoring based on K-means clustering and apriori correlation analysis 擷取燃氣渦輪機歷史數據，建立基於K-means聚類和離散化的資料探勘模型，進行Apriori相關分析，找出振動值相關變數的內在關係，以其能故障預警。
Google Scholar找Industry 4.0加上Lubricant發現：
Sensors and tribological systems: applications for industry 4.0
2019 Tribology and Industry: From the Origins to 4.0
2023 Recent Progress of Machine Learning Algorithms for the Oil and Lubricant Industry

我又根據Smart Maintenance 去找到:
2020 Part1/2Smart Maintenance: an empirically grounded conceptualization
2020 Part2/2Smart Maintenance: a research agenda for industrial maintenance management
2017 Intelligent predictive maintenance for fault diagnosis and prognosis in machine centers: Industry 4.0 scenario

關於Slideways
美孚的片子Exxonmobil : Slideway Full Length
Slideways Hydrostatic, linear bearing, balls, and rollers, vee flat rollers
Slideways Slide-way of machine tool

若加上Data, mining, Apriori:
2017 Intelligent Predictive Maintenance for Fault Diagnosis and Prognosis in Machine Centers — Industry 4.0 Scenario

ISO-18436標準資料ISO 18436-4：2014
ISO-18436連結哲文編製網頁

Final Project Presentation:

1.04108682 (駱艾俐)Ellina 德國人: Lung Cancer detection
2.11204604 洪哲文: Target: slidways Failure pre-detection (完成的報告) - causality relationship
3.11196021 (蔡水基)Susanto 泰國人: Churn no more! Churn prediction in telecommunication using machine learning.
4.11196011 (鄧佳琳)R.C.Nan : Classifying Food Product Using Image Processing and K-Means Clustering 使用影像處理和 K 均值聚類對食品進行分類 (Food Industry 4.0.)
5.11296006 (簡蓓蓓) 泰國人: The Development Project of Hot Springs Tourist Attraction:

W18 Final-exam week 期末考週

2024-01-09二 09:00-12:00 賴錦慧教授
Reference:subject should study
Weka1:Introduction to WEKA Weka 的介紹
Weka2:WEKA tutorial- classification Weka 的應用

還有四個同學做Final Project報告proposal
1.這同學在會記事務所工作，想學用fraud detecting
去這裡可以找到公司的財務資料br> 可以去google問How do I find a company's financial statements?會看到許多網站提供br>
2.印尼同學楊宜靜講coffee shop的stress研究
老師問她用decision tree 又用learning regression兩種，為什麼？

3.講11296016阮芳草 Hailey講 (姓阮可能是越南人)
Bin Packing using Reinfrocement Learning with Packing Configuration Tree
老師問:好像用了三種method
老師建議:在methodolgy第一頁，畫出一個明晰的流程圖，從資料進入到產出，就可釐清3個mehtod的關係

4.印尼同學講Workd engagement
老師問:你怎樣收集data (用問卷調查，用以前的碩士論文的資料)，你的資料是數字的話，可用到linearning regression但用到Coeffecient corelation有點怪，(學生解釋)
老師建議: 可想關於work engagement有沒有其他的vareable，比如多個分類，多兩三個attribute這樣不同的分類或variable就可用到c-c不然你的只有兩個變數..

Backup Data 其他參考資料

Book | Data Miming | Data Science for Business | Data Science and Big Data Analytics |
URL | Kaggle | yelp-Dataset |
▼1 Self Introduction 洪哲文

Self Introduction

I entered university in 1974, which was already half a century ago. The world was very simple at that time.
There were no personal computers, no cell phones, and many people didn’t even have phones at home. Can you imagine what the students were busy with at that time?
I was born in 1955, the same year as Chung-Yuan University. Fortunately, I started using computers with an Apple II. Then with the growth of PC and Internet, until today.
I've used CPM Dos, MS Dos, Windows, FreeBSD, Mac, and now Linux.
Before Windows 95, I wrote the purchasing sales and inventory system for my company through FoxBase. And with the help of my colleagues, in 2004 all systems were converted to FreeBSD, MySQL, PHP and Apache (so-called cloud systems).
Even so, I feel that there are still many things I don’t understand about the details of company management and the new development of computer software, and I want to clarify things.
Therefore, when I handed over my daily management tasks, I took advantage of this opportunity to continue learning. I would like to ask for your advice in the future. Thank you, teachers and classmates.

▼2 Dr.袁博

Dr.袁博-數據科學

Dr.袁博数据科学导论清華大學深圳研究院
離散連續文本符號
數據科學：對數據中隱含的信息和規律進行探究。
數據從何而來？數據背後的物理含義。如何對數據進行預處理？如何解釋數據分析的結果？在實際應用中可能產生的種種社會問題。
大數據的特性是：數據量大、數據的類型多、產生的速率高。3V: Volume, Veriety, Velocity
在大數據的時代:購買紀錄、評價打分、健康信息、活動軌跡、甚至於社交關係。都有可能須融合，成為待處理的：結構化數據、文本數據、數據缺失、噪音。等等。凡此都對傳統的算法帶來的嚴峻的挑戰。
大數據分析跟傳統數據分析的不同：
數據類型的多樣性，能夠從更多的維度對事物進行描述，將不同領域的數據進行關聯分析，產生1加1遠大於2的效果，顯著提高數據分析的效能。
1.2知行合一
數據本身就是一種生產資料，不再只是其他生產活動的產品。
從抽樣到全樣本分析，總體中每個個體的屬性。
模型驅動到數據驅動。(貓的訓練)
魔球MoneyBall-球員的評價-用大數據分析建模。技術統計特點做決定不只靠教練。

1.3 見微知著
數據科技應用：聚類/分類、關聯分析/推薦。
PageRank、精準廣告投放系統、協同過濾技術。
當數據變得無所不在，當我們時時刻刻生活在數據的海洋中
，對數據進行加工分析和利用，也必將成為一項不可或缺的高價值技能。
2.1 數據採集：感知萬物 -Dr.唐仙
大數據來源:Business Process, Human, Machine。
例：健康手環IMU=accelerometer+gyroscope+magnetometer;
一天有3G數據量，全球超過1億只在使用中。
2.2 數據採集：化繁為簡
常見無限傳感器節點：微控制器MCU-如msp430系列，有可編程擴展功能。數字處理企DSP，常用型號有TMS320系列。
還常碰到FPGA現場可編程邏輯陣列-可反覆擦寫, ASIC專用集成電路-設計好就無法修改，等。還有圖形處理器GPU。
2.3 數據採集：前程無限
魯棒性(Robustness)魯棒性亦稱健壯性、穩健性。
常用無線通訊技術：蜂窩技術Cellular communication。 WIFI技術-IEEE 802.11標準。藍芽Blue tooth短程無線系統。新的BLE低功耗藍芽，只發送較小數據塊。 ZigBee IEEE 802.15.4標準，廉價低複雜度低功耗低速率的無線連接技術。
WIFI>BT>BLE>ZigBee功耗。新的方法：近場通訊NFC，LoRa, NB-IoT, Sigfox等等。
Bing:授權頻譜licensed spectrum和非授權頻譜unlicensed spectrum有何異同?
2.4 數據採集：運籌帷幄
能量管理電路。提高能效。常用的線性穩壓器LDO。

硬件電路設計：傳感器設計、信號處理電路、無線通訊技術、電源管理單元。

3.1 數據可視化：一圖胜千言
數據可視化，正是一項致力於把抽象的數據或概念，轉化為適宜人類理解和接受的視覺化的信息的技術。
可視化是一種以圖像圖表或動畫的形式進行有效信息傳遞的技術。
通過對其中詞頻分佈情況的可視化，可以幫助我們迅速了解其中所包含的重要內容。

3.2 數據可視化：心靈之窗
RGB-0,0,0時=黑色; 255,255,255=白色;
人們在識別物體的時候會受到周圍已存在的事物的影響並利用經驗知識來進行輔助判斷
也許對於一個計算機程序而言區分不同顏色的對象和區分不同形狀的對像在計算量上並沒有太大的差異
常用的視覺表示包括位置長度梯度面積體積形狀色調飽和度對比度以及紋理等等
3.3 數據可視化：歷史上的可視化
在做數據分析工作的時候除了要關注發掘一般性的規律，對於一些例外和異常數據也不要輕易忽視
Charles Minard繪製的Flow Map拿破崙進遠征圖的舉例

Dr.袁博数据科学导论清華大學深圳研究院

1.1 初窺門徑
同學們好，很高興又和大家在慕課平臺見面了，歡迎和我一起漫步於數據科學的世界，數據是現代社會中的一個高頻詞彙，那麼什麼是數據呢，簡單來說，數據是一種對事物的描述與記錄，這個事物可以是看得見摸得着的事物，如汽車，也可以是抽象的過程和概念，如經濟發展態勢，通常來說，人們對周圍事物的理解，是通過一系列屬性來刻畫，如人的年齡身高體重性別種族等等。
所謂的數據對應的則是這些屬性可能的取值，根據類型的不同，可分爲連續型離散型符號型文本型等等，其中符號型和文本型數據必須先進行數值化，才能夠被計算機所處理，因此數據也可以看作是客觀事物的一種抽象表示，用於描述其性質狀態和相互關係等，有助於我們更好地對外部世界進行認知理解和分析，另一個非常相關的概念是信息，很多時候人們往往將其與數據混爲一談。
相對於原始的數據，信息是一種更高層次的抽象，它依託於數據，但體現的是數據的意義與內涵，用以幫助我們進行判斷和決策，例如一個學生的考試成績，本身可以看作是物理層面上的數據，而通過與其他同學的成績，或者該生上學期的成績進行比較，就能夠得到在班級中的相對水平，或者成績波動趨勢等邏輯層面的信息，當然信息自身也往往以數據的形式體現出來，如該同學進步明顯，與上學期相比班級排名提高了五名。
顧名思義，數據科學是一門和數據打交道的學科，隨着數據在近十年中變得無處不在，數據科學已經貫穿於，各個學科領域和人們的日常生活，就連一些傳統上和計算相關性不大的學科，也開始在工作中，積極尋求大規模數據分析的支持，正如在上世紀七八十年代，計算機還只是科研單位裏的高端設備，只有很少數經過專門訓練的人員，纔有機會接觸到它們，而現在從手環到手機，各種類型的計算設備早已和我們24小時相伴，成爲日常生活當中不可或缺的一部分。
同時數據科學並不是一項單一的技術，而是包括了數據的採集傳輸存儲分析，和展示等諸多環節的一門系統性學科，打一個比方的話，如同先要挖掘和運輸鐵礦石，再經過粗加工成生鐵，再進一步冶煉成標準化的鋼材，最後根據各行業的需求，定製成具有特定用途的器件，因此數據科學並不是一門獨立的學科，而是和統計學信號採集與處理數據庫系統，高性能計算計算機網絡，乃至社會科學等諸多領域有着千絲萬縷的聯繫。
提到數據科學，大家首先想到的可能是，寫程序和各種各樣看上去很高深的算法，首先計算機程序是數據科學研究工作最終的實現手段，它與數據科學相輔相成，但在一定程度上又相互獨立，換句話說，從事數據科學，並不一定要直接從事大量的程序編寫工作，也不一定拘泥於某一種特定的開發環境，在更多的情況下是利用各種分析工具，通過結合領域知識，對數據中隱含的信息和規律進行探究，其次，算法是數據科學的核心組成部分，但是僅僅學會算法還不足以真正地，將它轉變爲一個強大的可以改造我們生活的工具，我們還需要了解數據從何而來，數據背後的物理含義，如何對數據進行預處理，如何解釋數據分析的結果，以及在實際應用中，可能產生的種種社會問題等，任何一環的缺失，都可能造成數據分析工作的失敗，甚至導致嚴重的負面效果，這需要引起每一位有志於，從事數據科學工作的同學們的高度重視。
在數據科學中一個重要的概念是大數據，它可能是近年來最炙手可熱的科技詞彙之一，圍繞這個概念的學術研究，商業模式和產業應用舉不勝舉，大數據的三個典型特徵是，數據量大數據類型多和產生速率高，分別對應於Volume Variety和Velocity三個屬性，統稱爲3V，下面簡單談一下，它們對數據處理技術帶來的不同層面的挑戰。
數據量的激增，對存儲和計算能力帶來了前所未有的要求，傳統上的單機串行處理模式，已經被大規模並行及分佈式計算架構所取代，所需處理的數據量也已經從MB躍升到GB和TB，一些大型互聯網企業，需要處理的數據甚至已經達到PB級。
數據類型的多樣化，則對數據分析算法提出了新的挑戰，傳統算法通常針對單一數據源進行分析，數據類型也極爲有限，事實上大多數經典數據挖掘算法，都假設所處理的數據類型爲數值型，但是在大數據時代，客戶的購買記錄評價打分健康信息，活動軌跡以及在社交網絡中與其他人的關係等，都有可能需要進行有機的融合，以做出精準的用戶畫像，這些數據中既包含傳統意義上的結構化數據，又包含大量非結構化的文本數據，同時還可能有顯著的數據缺失以及噪聲等，這些都對傳統算法帶來了嚴峻的挑戰。
數據的高產生速率帶來的問題，可能主要集中在流數據的處理方面，傳統上我們一般假設數據是完備的，只要做一次性的處理，當數據源源不斷產生的時候，我們則需要算法能夠，及時地發現數據中蘊含的新模式，並對已有的模型進行適當的動態調整，在我個人看來，大數據分析與傳統數據分析相比，最大的特點和優勢在於，數據類型的多樣性，能夠從更多的維度對事物進行描述，將不同領域的數據進行關聯分析，產生1+1遠大於2的效果，顯着提高數據分析的效能，

1.2 知行合一
數據科學對社會發展和日常生活的影響如此之大，以至於出現了大數據思維這樣的新概念，改變了人們看待周圍事物的角度和工作的方式，有人舉過這樣一個例子，傳統上我們對事物的認識就像一張張照片，它們是一種離散的不完整的描述，而當我們能夠在短時間內，產生足夠多的照片的時候，我們看到的就是視頻，可以認爲是對事物連續的完整的描述，也就是說從量變產生了質變，讓我們能夠以前所未有的方式，來觀察和分析同樣的事物，大數據思維的內涵並沒有公認的定義，我個人認爲至少有以下三個方面值得同學們思考 16 00:01:01,092 --> 00:01:00,247，第一數據就是生產資料，今天數據已經不再是其它生產活動的一種副產品，而是具有極高價值潛力的生產資料，經過合理的開發利用，能夠直接創造可以衡量的經濟效益，例如對於以Facebook，和LinkedIn爲代表的各種社交網站，客戶的信息就是它們最重要的，甚至是唯一的價值所在，利用所掌握的海量客戶羣體的信息和行爲描述，可以衍生出各種類型的增值服務，第二，從採樣到全樣本分析，在統計學中，大家最熟悉的概念之一莫過於樣本，而且絕大多數統計分析，都是建立在對總體的採樣之上，希望通過對樣本的分析，推測出總體的特定屬性，在大數據時代，當數據的產生過程變得相對廉價，當我們擁有了存儲和分析海量數據的能力的時候，我們更關心的是總體中每一個個體的屬性，例如在過去我們只能從宏觀上分析，某家商場的客戶羣的構成比例，但是現在我們可以深入瞭解，每一位客戶的消費習慣，併爲其定製個性化的消費體驗，第三，從模型驅動到數據驅動，傳統上人們需要對客觀事物，進行深入的剖析和精準的建模，在理清各要素之間的因果關係之後進行推理和決策，正如物理學中的牛頓三大運動定律，而當數據足夠充沛的時候，則可以在一定程度上認爲，所掌握的數據已經能夠較爲全面地，對我們感興趣的事物進行描述，即便沒有嚴謹的理論模型的支持，依靠海量的數據，通過數據驅動的方式，同樣可以尋找出關鍵的規律和信息，一個典型的代表就是，現在流行的深度學習框架，它並不需要我們給出關於貓的精準定義，也不需要手工去提取圖像中相關的特徵，在高性能計算平臺的支持下，通過在大量含有貓的圖像樣本上進行訓練，就可以實現非常高的識別率，關於數據科學在實際生活中的價值，我想和大家分享一個在體育領域非常經典的案例，相關的故事已經出版成書並拍攝成電影，由著名影星布拉德皮特主演，它的名字叫Moneyball，翻譯成點球成金，正如足球在中國受關注的程度一樣，棒球是美國的一項頂級運動，它的職業棒球大聯盟Major League Baseball，吸引了大量的觀衆和商業資本，爲了打造一支強隊，一個衆所周知的條件是需要有雄厚的資金做後盾，用於購買頂尖的大牌球員，而這些球員的身價動輒數以千萬美元，只有少數大牌俱樂部，如New York Yankees紐約洋基隊，和Boston Red Sox波士頓紅襪隊，才能夠負擔得起，而位於大聯盟西區的一支名爲，Oakland Athletics奧克蘭運動家的平民俱樂部，卻在本世紀初在球隊總經理Billy Beane的帶領下，藉助嚴謹的數據分析工作，在運動員的選拔方面獨樹一幟，成就了以小博大的奇蹟，依靠豪門俱樂部約1/3的人員預算，不僅創造了驚人的20連勝的大聯盟記錄，並且以西區第一名的身份打入季後賽，成爲能夠與各路豪門比肩的一支強隊，他們在對球員的評價上並不完全依賴球探的主觀印象，或者簡單以身價而論，避免了在華而不實或者與球隊風格，不能契合的高身價球員上浪費寶貴的資金，相反，他們另闢蹊徑，對每一個候選球員的歷史數據和技術特點進行分析建模，根據球隊的實際需求，召集了一批看上去有各種缺陷，因此被轉會市場低估了的潛力股球員，同時在每一場比賽中，球隊都會根據隊員的技術統計特點，合理地排兵佈陣，減少了教練員的主觀性和片面性，在非常有限的資金條件下，奧克蘭運動家隊利用數據分析工具，優化資金和人員配置，組成了一支具有奪冠實力的球隊，對職業體育項目的發展產生了深遠的影響，

1.3 見微知著
和同學們日常生活最爲密切的數據科學，我想應該是在電子商務領域，在絕大多數電商平臺中，都可以看到推薦系統的影子，這也是爲什麼當你瀏覽或購買了某些商品之後，平臺會爲你推送一個你可能感興趣的商品的列表，這個列表中的商品就是，推薦系統根據你的購買記錄，和千千萬萬消費者的購買習慣，來爲你量身定製的，實際上電商平臺無時無刻不在，跟蹤和記錄着我們的個人信息以及消費行爲，並以此製作出用戶畫像，逐步加深對顧客的理解，從積極的一方面來說，利用數據分析技術，可以爲用戶提供更好的個性化消費體驗，幫助用戶從浩如煙海的商品中找到自己真正需要的，同時我們在選購商品的時候，很大程度上也會參考用戶評論來做決定，但是電商平臺上往往充斥着各種惡意的差評或者水軍，數據分析技術，在這裏也可以作爲一雙慧眼，幫我們去僞存真明察秋毫，然而從另一方面來說，商家所掌握的用戶信息，也可以被用來從事對我們不利的行爲，例如有些銷售奢侈品的商家，會將高仿的假貨或者殘次品混雜在正品中以牟取暴利，但是爲了降低被用戶察覺和投訴的幾率，他們在銷售的時候會傾向於將假貨發送給特定的顧客，例如商家可能會去查看，一名顧客的歷史投訴以及評論記錄，來了解他是否對商品的質量非常挑剔或者非常懂行，同時還有一個同學們可能會忽視的因素，顧客的居住地點，如果附近就有該品牌的專賣店，那麼顧客很有可能拿着收到的商品前去進行驗真，這樣就容易露出馬腳，雖然數據科學的應用領域非常廣泛，各種新穎的應用問題也層出不窮，但是從核心技術來說，主要可分爲聚類分類關聯分析推薦等幾大類，其中聚類屬於無監督的學習，體現出物以類聚人以羣分的基本思想，將原始數據根據自身屬性進行分組，使得分在同一組的樣本比較相似，而不同組之間的樣本差異相對較大，分類則屬於有監督的學習，通過在給定的一組帶標籤的學習樣本上，進行模型訓練，最終在樣本空間中形成合適的判決平面，進而對未知樣本進行較爲準確的識別，如果說聚類和分類是數據科學機器學習，以及模式識別領域中的共性技術的話，關聯分析和推薦系統則是數據科學中比較有特色的技術，在關聯分析中，通過分析大量的用戶購買記錄，可以發現哪些商品經常被同時購買，以及買了某一種商品的顧客，很有可能會購買哪些其他商品，推薦系統的範疇更加廣泛，從用於搜索引擎的結果排序的PageRank算法，到用於商品推薦的協同過濾技術，以及精準廣告投放系統等都是其典型代表，作爲一名數據科學領域的教師，我認爲數據科學，是幫助同學們邁進各個學科領域和社會行業，實現人生理想的強有力的助推器，無論你今後希望在互聯網人工智能醫療，健康金融電子商務或交通物流等行業發展，或是計劃在學術界尋找自己的人生價值，數據科學可能都是你目前最好的一個選擇，當數據變得無所不在，當我們時時刻刻生活在數據的海洋中，對數據進行加工分析和利用，也必將成爲一項不可或缺的高價值技能，目前數據科學已經不僅僅是一個專業技術領域，而是上升到了國家戰略層面，在數字基礎設施建設數據資源整合，和開發共享以及數據安全等層面，都將迎來重大的舉措，隨着數據科學與社會民生的交匯融合，數據科學對經濟發展社會治理，國家管理等都產生着重大的影響，爲實現政府決策科學化，社會治理精準化，公共服務高效化提供了有力的支撐，同時爲實體經濟的轉型升級，以及與數字經濟的融合發展奠定了基礎，成爲數據驅動型社會創新發展的重要動能，作爲一門導論性質的課程，我們的出發點在於開闊同學們的視野，爲大家展示與數據科學相關的方方面面的知識，特別是一些在現有的數據科學慕課中，較少涉及的層面，如數據可視化高性能計算，基於傳感器的數據採集以及數據倫理等，數據科學博大精深，相關從業人員所從事的具體工作也千差萬別，我們努力在本課程中，爲大家介紹一些普適性的思想技術和原理，雖然同學們的專業背景和學習訴求可能各有不同，但是我相信大家都可以把這門課程，作爲數據科學的啓蒙，通過認真學習和積極的思考與探索，培養對數據科學的認知與興趣，並初步掌握所需的學習方法和技術路線，如果同學們對數據挖掘算法特別感興趣，歡迎大家隨後學習我們的相關慕課，數據挖掘：理論與算法，那麼接下來就讓我們一起走進數據科學的大門，

Data Mining啤酒與尿布的神話
 Ask Dan! about DSS
▼3 Introduction to Data Mining - Dr. Qin Lv

Introduction to Data Mining | University of Colorado Boulder
Dr. Qin Lv

▼9 aaa

aaa

Lorem ipsum dolor sit amet.

Lorem ipsum dolor sit amet.

W01 課程介紹

W02 Introduction to data mining and data science

W03 Understanding the data

W04 Data preprocessing methods

W05 Taiwan National Day (Holiday)

W06 Association analysis

W07 Association analysis

W08 下週Midterm 期中考試

老師有事，改為線上課！

W09 Mid-term week 期中考週

W10 Classification and prediction models

Presentation : Data science for business CHAPTER 4 : Fitting a Model to Data (Presetation)

W11 Mining frequent patterns

W12 Classification and prediction models

W13 CLASSIFICATION:Gain Ratio (C4.5) & Gini Index (CART) etc.

W14 Rule-based Classification

W15 Text mining

W16 Recommender systems and applications

W17 Final project presentation 期末報告

DataScience的Final Project準備的一些papers

Final Project Presentation:

W18 Final-exam week 期末考週

Backup Data 其他參考資料

Self Introduction

Dr.袁博-數據科學

Introduction to Data Mining | University of Colorado Boulder

aaa

Presentation : Data science for business
CHAPTER 4 : Fitting a Model to Data (Presetation)