Tag: Linear regression

Generalised Linear Model: วิธีใช้ glm() ในภาษา R เพื่อทำนายข้อมูลที่ไม่ปกติ — Linear, Logistic, และ Poisson Regression

ในบทความนี้ เราจะไปทำความรู้จักกับ generalised linear model (GLM) และวิธีทำ GLM ในภาษา R กัน

ถ้าพร้อมแล้ว ไปเริ่มกันเลย

🤔 GLM คืออะไร?

GLM เป็นเทคนิคทางสถิติที่ใช้ทำนายข้อมูลที่มีการกระจายตัวไม่ปกติ (non-normal distribution) เช่น ข้อมูลที่มีผลลัพธ์เพียง 0 และ 1

GLM ทำนายข้อมูลเหล่านี้โดยการต่อยอดจากสมการเส้นตรง (linear model) และมี 3 องค์ประกอบ ได้แก่:

Family: การกระจายตัวของตัวแปรตาม (y)
Linear predictors: สมการเส้นตัวตรงของตัวแปรต้น (x) หรือตัวแปรทำนาย (predictor)
Link function: function ที่เชื่อมตัวแปรต้นกับตัวแปรตามเข้าด้วยกัน

💻 GLM ในภาษา R

ในภาษา R เราสามารถใช้งาน GLM ได้ผ่าน glm() function ซึ่งต้องการข้อมูล 3 อย่าง:

glm(formula, data, family)

formula = ความสัมพันธ์ระหว่างตัวแปรต้นและตัวแปรตาม ในรูปแบบ y ~ x
data = ชุดข้อมูลที่ใช้ในการวิเคราะห์
family = การกระจายตัวของตัวแปรตาม

จะสังเกตว่า glm() ไม่มี parametre สำหรับ link function ทั้งนี้เป็นเพราะ glm() เรียกใช้ link function ให้อัตโนมัติตาม family ที่เรากำหนด

ทั้งนี้ ประเภทข้อมูล, family, และ link function ที่เราสามารถเรียกใช้ glm() ได้มีดังนี้:

Data	family	Link Function
Normal	`gaussian`	`link = "identity”`
Binomial	`binomial`	`link = "logit”`
Poisson	`poisson`	`link = "log”`
Quasi-poisson	`quasipoisson`	`link = "log”`
Gamma	`Gamma`	`link = "inverse”`

เราไปดูตัวอย่างการใช้งาน glm() เพื่อทำนายและแปลผลกัน

☕ ตัวอย่างข้อมูล: coffee_shop

เราจะไปดูตัวอย่างการใช้ glm() เพื่อทำนายข้อมูล 3 ประเภทกัน:

Linear regression
Logistic regression
Poisson regression

โดยเราจะใช้ตัวอย่างเป็นข้อมูลจำลองชื่อ coffee_shop ซึ่งประกอบด้วยข้อมูลการขายจากร้านกาแฟแห่งหนึ่ง และมีรายละเอียดดังนี้:

No.	Column	Description
1	`day`	วันที่
2	`temp`	อุณหภูมิโดยเฉลี่ยของวัน
3	`promo`	เป็นวันที่มีโปรโมชัน (มี, ไม่มี)
4	`weekend`	เป็นวันหยุดสุดสัปดาห์ (วันหยุด, วันธรรมดา)
5	`sales`	ยอดขาย
6	`customers`	จำนวนลูกค้าในแต่ละวัน
7	`sold_out`	ขายหมด (หมด, ไม่หมด)

ก่อนวิเคราะห์ข้อมูล เราจะสร้าง coffee_shop ตามนี้:

# Generate mock coffee shop dataset (15 days)

## Set seed for reproducibility
set.seed(123)

## Generate
coffee_shop <- data.frame(
  
  ## Generate 15 days
  day = 1:15,
  
  ## Generate daily temperature
  temp = round(rnorm(15,
                     mean = 25,
                     sd = 5),
               1),
  
  ## Generate promotion day
  promo = sample(c(0, 1),
                 15,
                 replace = TRUE),
  
  ## Generate weekend
  weekend = c(0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1),
  
  ## Generate the number of sales
  sales = round(rnorm(15,
                      mean = 300,
                      sd = 50)),
  
  ## Generate the number of daily customers
  customers = rpois(15,
                    lambda = 80),
  
  ## Generate sold-out
  sold_out = sample(c(0, 1),
                    15,
                    replace = TRUE)
)

## Convert binary variables to factors
coffee_shop$promo <- factor(coffee_shop$promo,
                            levels = c(0, 1),
                            labels = c("NoPromo", "Promo"))

coffee_shop$weekend <- factor(coffee_shop$weekend,
                              levels = c(0, 1),
                              labels = c("Weekday", "Weekend"))

coffee_shop$sold_out <- factor(coffee_shop$sold_out,
                               levels = c(0, 1),
                               labels = c("No", "Yes"))

## View the dataset
print(coffee_shop)

ผลลัพธ์:

   day temp   promo weekend sales customers sold_out
1    1 22.2 NoPromo Weekday   246        73      Yes
2    2 23.8   Promo Weekday   296        76      Yes
3    3 32.8 NoPromo Weekday   354        83       No
4    4 25.4   Promo Weekday   293        87       No
5    5 25.6   Promo Weekend   242        78      Yes
6    6 33.6 NoPromo Weekend   259        91      Yes
7    7 27.3 NoPromo Weekday   334        71      Yes
8    8 18.7 NoPromo Weekday   284        76       No
9    9 21.6 NoPromo Weekday   234        72      Yes
10  10 22.8   Promo Weekend   270        79       No
11  11 31.1 NoPromo Weekend   294        82      Yes
12  12 26.8   Promo Weekday   344        79       No
13  13 27.0   Promo Weekday   292        79       No
14  14 25.6 NoPromo Weekday   316        92       No
15  15 22.2 NoPromo Weekend   139        77      Yes

เราไปดูวิธีทำนายข้อมูลกัน

1️⃣ Linear Regression

Linear regression เป็นการทำนายข้อมูล numeric เช่น ยอดขาย (sales) ซึ่งเราสามารถใช้ glm() ทำนายได้ดังนี้:

# Create a regression model with glm()
linear_reg <- glm(sales ~ temp + promo + weekend,
                  data = coffee_shop,
                  family = gaussian)

เราสามารถดู model ได้ด้วย summary():

# Get model summary
summary(linear_reg)

ผลลัพธ์:

Call:
glm(formula = sales ~ temp + promo + weekend, family = gaussian, 
    data = coffee_shop)

Coefficients:
               Estimate Std. Error t value Pr(>|t|)   
(Intercept)      96.599     59.669   1.619  0.13375   
temp              7.703      2.283   3.373  0.00621 **
promoPromo       23.014     18.537   1.241  0.24025   
weekendWeekend  -73.444     19.654  -3.737  0.00328 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 1222.243)

    Null deviance: 39702  on 14  degrees of freedom
Residual deviance: 13445  on 11  degrees of freedom
AIC: 154.54

Number of Fisher Scoring iterations: 2

จากผลลัพธ์ เราจะเห็นความสำคัญของตัวแปรต้นและ coefficient ซึ่งระบุการเปลี่ยนแปลงของตัวแปรตามการเปลี่ยนของตัวแปรต้น:

ตัวแปรที่สามารถทำนาย sales ได้อย่างมีนัยสำคัญ คือ temp และ weekend (สังเกตจาก **)
promo ไม่สามารถทำนาย sales ได้อย่างมีนัยสำคัญ
Coefficient ของ temp คือ 7.70 ซึ่งหมายถึง อุณหภูมิเปลี่ยน 1 หน่วย ยอดขายจะเพิ่มขึ้น 7.70 หน่วย
Coefficient ของ weekend คือ -73.44 ซึ่งหมายถึง วันหยุดสุดสัปดาห์ ยอดขายจะลดลง 73.44 หน่วย

2️⃣ Logistic Regression

Logistic regression เป็นการทำนายข้อมูลที่เป็นมีผลลัพธ์เพียง 2 ค่า เช่น:

ใช่, ไม่ใช่
ผ่าน, ไม่ผ่าน
ตรง, ไม่ตรง

ใน coffee_shop เรามี sold_out ซึ่งเราสามารถทำนายด้วย logistic regression ด้วย glm() ได้แบบนี้:

# Create a logistic regression model
log_reg <- glm(sold_out ~ temp + promo + weekend,
               data = coffee_shop,
               family = binomial)

จากนั้น ดู model ด้วย summary():

# Get model summary
summary(log_reg)

ผลลัพธ์:

Call:
glm(formula = sold_out ~ temp + promo + weekend, family = binomial, 
    data = coffee_shop)

Coefficients:
               Estimate Std. Error z value Pr(>|z|)
(Intercept)     1.30934    3.92442   0.334    0.739
temp           -0.04448    0.15189  -0.293    0.770
promoPromo     -1.73681    1.31881  -1.317    0.188
weekendWeekend  2.15038    1.45643   1.476    0.140

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 20.728  on 14  degrees of freedom
Residual deviance: 16.426  on 11  degrees of freedom
AIC: 24.426

Number of Fisher Scoring iterations: 4

จะเห็นได้ว่า หน้าตาผลลัพธ์คล้ายกับ linear regression แต่สิ่งที่แตกต่างกัน คือ coefficient อยู่ในรูป log-odd ซึ่งเราสามารถถอดรูปได้ด้วย exp() เพื่อแปลผล:

# Transform coefficient
exp(coef(log_reg))

ผลลัพธ์:

   (Intercept)           temp     promoPromo weekendWeekend 
     3.7037124      0.9564938      0.1760812      8.5881237

เราสามารถแปลผลได้ดังนี้:

Predictor	Coefficient	Interpretation
`temp`	0.96	เมื่ออุณหภูมิสูงขึ้น 1 หน่วย ร้านมีโอกาสขายหมดเพิ่มขึ้น 0.96
`promo`	0.18	เมื่อมีโปรโมชัน ร้านมีโอกาสขายหมดเพิ่มขึ้น 0.18
`weekend`	8.59	เมื่อเป็นวันสุดสัปดาห์ ร้านมีโอกาสขายหมดเพิ่มขึ้น 8.59

3️⃣ Poisson Regression

Poisson regression เป็นการทำนายข้อมูลการนับ (count data) หรือข้อมูลที่เกิดขึ้นในช่วงเวลาที่กำหนด เช่น:

จำนวนรถบนถนนในแต่ละชั่วโมง
จำนวนข้อความที่ได้รับใน 1 วัน
จำนวนสินค้าที่ขายได้ใน 3 เดือน

ใน coffee_shop เรามี customers ซึ่งสามารถทำนาย poisson regression ผ่าน glm() ได้ดังนี้:

# Create a poisson regression model
poisson_reg <- glm(customers ~ temp + promo + weekend,
                   data = coffee_shop,
                   family = poisson)

ดู model:

# Get model summary
summary(poisson_reg)

ผลลัพธ์:

Call:
glm(formula = customers ~ temp + promo + weekend, family = poisson, 
    data = coffee_shop)

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)    4.108939   0.192499  21.345   <2e-16 ***
temp           0.010077   0.007299   1.381    0.167    
promoPromo     0.010327   0.059616   0.173    0.862    
weekendWeekend 0.012692   0.062739   0.202    0.840    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 7.0199  on 14  degrees of freedom
Residual deviance: 4.8335  on 11  degrees of freedom
AIC: 106.06

Number of Fisher Scoring iterations: 3

Coefficient ของ poisson regression อยู่ในรูป log เช่นเดียวกับ logistic regression ดังนั้น เราจะถอดรูปด้วย exp() ก่อนแปลผล:

# Transform the coefficients
exp(coef(poisson_reg))

ผลลัพธ์:

   (Intercept)           temp     promoPromo weekendWeekend 
     60.882073       1.010127       1.010381       1.012773

จะเห็นได้ว่า coefficient ของทั้ง 3 ตัวแปรต้นอยู่ที่ 1.01 ซึ่งหมายถึง การเปลี่ยนแปลงตัวแปรต้นตัวใดตัวหนึ่ง ทำให้จำนวนลูกค้าเพิ่มขึ้น 1 คน

💪 Summary

ในบทความนี้ เราได้ไปทำความรู้จักกับ GLM ซึ่งเป็นเทคนิคทางสถิติที่ใช้ทำนายข้อมูลที่ไม่ปกติ และได้ดูวิธีการใช้ glm() function ในภาษา R เพื่อทำนายข้อมูล 3 ประเภท:

Linear regression
Logistic regression
Poisson regression

😺 GitHub

ดู code ทั้งหมดในบทความนี้ได้ที่ GitHub

📃 References

What is GLM?

GLM in R:

✅ R Book for Psychologists: หนังสือภาษา R สำหรับนักจิตวิทยา

📕 ขอฝากหนังสือเล่มแรกในชีวิตด้วยนะครับ 😆

🙋 ใครที่กำลังเรียนจิตวิทยาหรือทำงานสายจิตวิทยา และเบื่อที่ต้องใช้ software ราคาแพงอย่าง SPSS และ Excel เพื่อทำข้อมูล

💪 ผมขอแนะนำ R Book for Psychologists หนังสือสอนใช้ภาษา R เพื่อการวิเคราะห์ข้อมูลทางจิตวิทยา ที่เขียนมาเพื่อนักจิตวิทยาที่ไม่เคยมีประสบการณ์เขียน code มาก่อน

ในหนังสือ เราจะปูพื้นฐานภาษา R และพาไปดูวิธีวิเคราะห์สถิติที่ใช้บ่อยกัน เช่น:

Correlation
t-tests
ANOVA
Reliability
Factor analysis

🚀 เมื่ออ่านและทำตามตัวอย่างใน R Book for Psychologists ทุกคนจะไม่ต้องพึง SPSS และ Excel ในการทำงานอีกต่อไป และสามารถวิเคราะห์ข้อมูลด้วยตัวเองได้ด้วยความมั่นใจ

แล้วทุกคนจะแปลกใจว่า ทำไมภาษา R ง่ายขนาดนี้ 🙂‍↕️

👉 สนใจดูรายละเอียดหนังสือได้ที่ meb:

ดูรายละเอียดหนังสือ R Book for Psychologists

2025-06-26

วิธีสร้าง linear regression ด้วย lm() ในภาษา R — ตัวอย่างการทำนายราคาเพชรใน diamonds dataset

Linear regression เป็นวิธีการทำนายข้อมูลด้วยสมการเส้นตรง:

y = a + bx

y = ตัวแปรตาม หรือข้อมูลที่ต้องการทำนาย
a = จุดตัดระหว่าง x และ y (intercept)
b = ค่าความชัด (slope)
x = ตัวแปรต้น

เนื่องจากเป็นเทคนิคที่ใช้งานและทำความเข้าใจได้ง่าย linear regression จึงเป็นวิธีที่นิยมใช้ในการทำนายข้อมูลในบริบทต่าง ๆ เช่น:

ทำนาย	จาก
กำไร	ค่าโฆษณา
ความสามารถของนักกีฬา	ชั่วโมงฝึกซ้อม
ความดันเลือด	ปริมาณยา + อายุ
ผลลิตทางการเกษตร	ปริมาณน้ำ + ปุ๋ย

ในบทความนี้ เราจะมาดูวิธีใช้ linear regression ในภาษา R กัน

ถ้าพร้อมแล้ว ไปเริ่มกันเลย

💎 Example Dataset: diamonds

ในบทความนี้ เราจะใช้ diamonds dataset เป็นตัวอย่างในการใช้ linear regression กัน

diamonds dataset เป็น built-in dataset จาก ggplot2 package ซึ่งมีข้อมูลเพชรมากกว่า 50,000 ตัวอย่าง และประกอบด้วย 10 columns ดังนี้:

No.	Column	Description
1	`price`	ราคา (ดอลล่าร์สหรัฐฯ)
2	`caret`	น้ำหนัก
3	`cut`	คุณภาพ
4	`color`	สี
5	`clarity`	ความใสของเพชร
6	`x`	ความยาว
7	`y`	ความกว้าง
8	`z`	ความลึก
9	`depth`	สัดส่วนความลึก
10	`table`	สัดส่วนความกว้างของยอดเพชรต่อส่วนที่กว้างที่สุด

เป้าหมายของเรา คือ ทำนายราคาเพชร (price)

⬇️ Load diamonds

ในการใช้งาน diamonds เราสามารถเรียกใช้งาน dataset ได้ดังนี้:

ขั้นที่ 1. ติดตั้งและโหลด ggplot2:

# Install
install.packages("ggplot2")

# Load
library(ggplot2)

ขั้นที่ 2. โหลด diamonds dataset:

# Load dataset
data(diamonds)

ขั้นที่ 3. ดูตัวอย่างข้อมูล 10 rows แรกใน dataset:

# Preview the dataset
head(diamonds, 10)

ผลลัพธ์:

# A tibble: 10 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39

🍳 Prepare the Dataset

ก่อนจะทำนายราคาเพชรด้วย linear regression เราจะเตรียม diamonds dataset ใน 3 ขั้นตอนก่อน ได้แก่:

One-hot encoding
Log transformation
Split data

🪆 Step 1. One-Hot Encoding

ในกรณีที่ตัวแปรต้นที่เป็น categorical เราจะต้องแปลงตัวแปรเหล่านี้ให้เป็น numeric ก่อน ซึ่งเราสามารถทำได้ด้วย one-hot encoding ดังตัวอย่าง:

ก่อน one-hot encoding:

Data	Cut
1	Ideal
2	Good
3	Fair

หลัง one-hot encoding:

Data	Cut_Ideal	Cut_Good	Cut_Fair
1	1	0	0
2	0	1	0
3	0	0	1

ในภาษา R เราสามารถทำ one-hot encoding ได้ด้วย model.matrix() ดังนี้:

# Set option for one-hot encoding
options(contrasts = c("contr.treatment",
                      "contr.treatment"))

# One-hot encode
cat_dum <- model.matrix(~ cut + color + clarity - 1,
                        data = diamonds)

จากนั้น เราจะนำผลลัพธ์ที่ได้ไปรวมกับตัวแปรตามและตัวแปรต้นที่เป็น numeric:

# Combine one-hot-encoded categorical and numeric variables
dm <- cbind(diamonds[, c("carat",
                         "depth",
                         "table",
                         "x",
                         "y",
                         "z")],
            cat_dum,
            price = diamonds$price)

เราสามารถเช็กผลลัพธ์ของ one-hot encoding ได้ด้วย str():

# Check the results
str(dm)

ผลลัพธ์:

'data.frame':	53940 obs. of  25 variables:
 $ carat       : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
 $ depth       : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
 $ table       : num  55 61 65 58 58 57 57 55 61 61 ...
 $ x           : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
 $ y           : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
 $ z           : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
 $ cutFair     : num  0 0 0 0 0 0 0 0 1 0 ...
 $ cutGood     : num  0 0 1 0 1 0 0 0 0 0 ...
 $ cutVery Good: num  0 0 0 0 0 1 1 1 0 1 ...
 $ cutPremium  : num  0 1 0 1 0 0 0 0 0 0 ...
 $ cutIdeal    : num  1 0 0 0 0 0 0 0 0 0 ...
 $ colorE      : num  1 1 1 0 0 0 0 0 1 0 ...
 $ colorF      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ colorG      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ colorH      : num  0 0 0 0 0 0 0 1 0 1 ...
 $ colorI      : num  0 0 0 1 0 0 1 0 0 0 ...
 $ colorJ      : num  0 0 0 0 1 1 0 0 0 0 ...
 $ claritySI2  : num  1 0 0 0 1 0 0 0 0 0 ...
 $ claritySI1  : num  0 1 0 0 0 0 0 1 0 0 ...
 $ clarityVS2  : num  0 0 0 1 0 0 0 0 1 0 ...
 $ clarityVS1  : num  0 0 1 0 0 0 0 0 0 1 ...
 $ clarityVVS2 : num  0 0 0 0 0 1 0 0 0 0 ...
 $ clarityVVS1 : num  0 0 0 0 0 0 1 0 0 0 ...
 $ clarityIF   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ price_log   : num  5.79 5.79 5.79 5.81 5.81 ...

ตอนนี้ ตัวแปรต้นที่เป็น categorical ถูกแปลงเป็น numeric ทั้งหมดแล้ว

📈 Step 2. Log Transformation

ในกรณีที่ตัวแปรตามมีการกระจายตัว (distribution) ไม่ปกติ linear regression ทำนายข้อมูลได้ไม่เต็มประสิทธิภาพนัก

เราสามารถตรวจสอบการกระจายตัวของตัวแปรตามได้ด้วย ggplot():

# Check the distribution of `price`
ggplot(dm,
       aes(x = price)) +
  
  ## Instantiate a histogram
  geom_histogram(binwidth = 100,
                 fill = "skyblue3") +
  
  ## Add text elements
  labs(title = "Distribution of Price",
       x = "Price",
       y = "Count") +
  
  ## Set theme to minimal
  theme_minimal()

ผลลัพธ์:

จากกราฟ เราจะเห็นได้ว่า ตัวแปรตามมีการกระจายตัวแบบเบ้ขวา (right-skewed)

ดังนั้น ก่อนจะใช้ linear regression เราจะต้องแปรตัวแปรตามให้มีการกระจายตัวแบบปกติ (normal distribution) ก่อน ซึ่งเราสามารถทำได้ด้วย log transformation ดังนี้:

# Log-transform `price`
dm$price_log <- log(dm$price)

# Drop `price`
dm$price <- NULL

หลัง log transformation เราสามารถเช็กการกระจายตัวด้วย ggplot() อีกครั้ง:

# Check the distribution of logged `price`
ggplot(dm,
       aes(x = price_log)) +
  
  ## Instantiate a histogram
  geom_histogram(fill = "skyblue3") +
  
  ## Add text elements
  labs(title = "Distribution of Price After Log Transformation",
       x = "Price (Logged)",
       y = "Count") +
  
  ## Set theme to minimal
  theme_minimal()

ผลลัพธ์:

จะเห็นได้ว่า การกระจายตัวของตัวแปรตามใกล้เคียงกับการกระจายตัวแบบปกติมากขึ้นแล้ว

🚄 Step 3. Split the Data

ในขั้นสุดท้ายก่อนใช้ linear regression เราจะแบ่งข้อมูลออกเป็น 2 ชุด:

Training set สำหรับสร้าง linear regression model
Test set สำหรับประเมินความสามารถของ linear regression model

ในบทความนี้ เราจะแบ่ง 80% ของ dataset เป็น training set และ 20% เป็น test set:

# Split the data

## Set seed for reproducibility
set.seed(181)

## Training index
train_index <- sample(nrow(dm),
                      0.8 * nrow(dm))

## Create training set
train_set <- dm[train_index, ]

## Create test set
test_set <- dm[-train_index, ]

ตอนนี้ เราพร้อมที่จะสร้าง linear regression model กันแล้ว

🏷️ Linear Regression Modelling

การสร้าง linear regression model มีอยู่ 3 ขั้นตอน ได้แก่:

Fit the model
Make predictions
Evaluate the model performance

💪 Step 1. Fit the Model

ในขั้นแรก เราจะสร้าง model ด้วย lm() ซึ่งต้องการ input 2 อย่าง:

lm(formula, data)

formula = สูตรการทำนาย โดยเราต้องกำหนดตัวแปรต้นและตัวแปรตาม
data = ชุดข้อมูลที่ใช้สร้าง model

ในการทำนายราคาเพชร เราจะใช้ lm() แบบนี้:

# Fit the model
linear_reg <- lm(price_log ~ .,
                 data = train_set)

อธิบาย code:

price_log ~ . หมายถึง ทำนายราคา (price_log) ด้วยตัวแปรต้นทั้งหมด (.)
data = train_set หมายถึง เรากำหนดชุดข้อมูลที่ใช้เป็น training set

เราสามารถดูข้อมูลของ model ได้ด้วย summary():

# View the model
summary(linear_reg)

ผลลัพธ์:

Call:
lm(formula = price_log ~ ., data = train_set)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.2093 -0.0930  0.0019  0.0916  9.8935 

Coefficients: (1 not defined because of singularities)
                 Estimate Std. Error  t value Pr(>|t|)    
(Intercept)    -2.7959573  0.0705854  -39.611  < 2e-16 ***
carat          -0.5270039  0.0086582  -60.867  < 2e-16 ***
depth           0.0512357  0.0008077   63.437  < 2e-16 ***
table           0.0090154  0.0005249   17.175  < 2e-16 ***
x               1.1374016  0.0055578  204.651  < 2e-16 ***
y               0.0290584  0.0031345    9.271  < 2e-16 ***
z               0.0340298  0.0054896    6.199 5.73e-10 ***
cutFair        -0.1528658  0.0060005  -25.476  < 2e-16 ***
cutGood        -0.0639105  0.0036547  -17.487  < 2e-16 ***
`cutVery Good` -0.0313800  0.0025724  -12.199  < 2e-16 ***
cutPremium     -0.0451760  0.0026362  -17.137  < 2e-16 ***
cutIdeal               NA         NA       NA       NA    
colorE         -0.0573940  0.0032281  -17.779  < 2e-16 ***
colorF         -0.0892633  0.0032654  -27.336  < 2e-16 ***
colorG         -0.1573861  0.0032031  -49.136  < 2e-16 ***
colorH         -0.2592763  0.0034037  -76.175  < 2e-16 ***
colorI         -0.3864526  0.0038360 -100.742  < 2e-16 ***
colorJ         -0.5258789  0.0047183 -111.455  < 2e-16 ***
claritySI2      0.4431577  0.0079170   55.976  < 2e-16 ***
claritySI1      0.6087513  0.0078819   77.234  < 2e-16 ***
clarityVS2      0.7523161  0.0079211   94.976  < 2e-16 ***
clarityVS1      0.8200656  0.0080463  101.918  < 2e-16 ***
clarityVVS2     0.9381319  0.0082836  113.252  < 2e-16 ***
clarityVVS1     1.0033931  0.0085098  117.910  < 2e-16 ***
clarityIF       1.0898015  0.0092139  118.277  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1825 on 43128 degrees of freedom
Multiple R-squared:  0.9677,	Adjusted R-squared:  0.9676 
F-statistic: 5.611e+04 on 23 and 43128 DF,  p-value: < 2.2e-16

Note: ดูวิธีการอ่านผลลัพธ์ได้ที่ Explaining the lm() Summary in R และ Understanding Linear Regression Output in R

🔮 Step 2. Make Predictions

ในขั้นที่สอง เราจะใช้ model เพื่อทำนายราคาด้วย predict():

# Predict in the outcome space
pred <- exp(pred_log)

# Preview predictions
head(pred_log)

ผลลัพธ์:

       2        5        9       16       19       22 
5.828071 5.816460 6.111859 5.777434 5.865820 6.088356

จะเห็นว่า ราคาที่ทำนายยังอยู่ในรูป log ซึ่งเราต้องแปลงกลับเป็นราคาปกติด้วย exp():

# Predict in the outcome space
pred <- exp(pred_log)

# Preview predictions
head(pred)

ผลลัพธ์:

       2        5        9       16       19       22 
339.7028 335.7812 451.1766 322.9295 352.7713 440.6961

เราสามารถเปรียบเทียบราคาจริงกับราคาที่ทำนาย พร้อมความคลาดเคลื่อน ได้ดังนี้:

# Compare predictions to actual
results <- data.frame(actual = round(exp(test_set$price_log), 2),
                      predicted = round(pred, 2),
                      diff = round(exp(test_set$price_log) - pred, 2))

# Print results
head(results)

ผลลัพธ์:

   actual predicted    diff
2     326    339.70  -13.70
5     335    335.78   -0.78
9     337    451.18 -114.18
16    345    322.93   22.07
19    351    352.77   -1.77
22    352    440.70  -88.70

🎯 Step 3. Evaluate the Model Performance

ในขั้นสุดท้าย เราจะประเมิน model โดยใช้ 2 ตัวชี้วัด ได้แก่:

Mean absolute error (MAE): ค่าเฉลี่ยความคลาดเคลื่อนโดยสัมบูรณ์
Root mean squared error (RMSE): ค่าเฉลี่ยความคลาดเคลื่อนแบบยกกำลังสอง

ทั้งสองตัวคำนวณความแตกต่างระหว่างสิ่งที่ทำนายและข้อมูลจริง ยิ่ง MAE และ RMSE สูง ก็หมายความว่า การทำนายมีความคาดเคลื่อนมาก แสดงว่า model ทำงานได้ไม่ดีนัก

ในทางกลับกัน ถ้า MAE และ RMSE น้อย ก็แสดงว่า การทำนายใกล้เคียงกับข้อมูลจริง และ model มีความแม่นยำสูง

(Note: เรียนรู้ความแตกต่างระหว่าง MAE และ RMSE ได้ที่ Loss Functions in Machine Learning Explained)

เราสามารถคำนวณ MAE และ RMSE ได้ดังนี้:

# Calculate MAE
mae <- mean(abs(results$diff))

# Calculate RMSE
rmse <- sqrt(mean((results$diff)^2))

# Print the results
cat("MAE:", round(mae, 2), "\n")
cat("RMSE:", round(rmse, 2))

ผลลัพธ์:

MAE: 491.71
RMSE: 1123.68

จากผลลัพธ์ เราจะเห็นว่า โดยเฉลี่ย model ทำนายราคาคลาดเคลื่อนไปประมาณ 492 ดอลล่าร์ (MAE)

😎 Summary

ในบทความนี้ เราได้ดูวิธีการทำ linear regression ในภาษา R กัน

เราดูวิธีการเตรียมข้อมูลสำหรับ linear regression:

One-hot encoding ด้วย model.matrix()
Log transformation ด้วย log()
Split data ด้วย sample()

สร้าง linear regression model ด้วย lm() พร้อมประเมิน model ด้วย predict() และการคำนวณค่า MAE และ RMSE

😺 GitHub

ดู code ทั้งหมดในบทความนี้ได้ที่ GitHub

📃 References

✅ R Book for Psychologists: หนังสือภาษา R สำหรับนักจิตวิทยา

📕 ขอฝากหนังสือเล่มแรกในชีวิตด้วยนะครับ 😆

Correlation
t-tests
ANOVA
Reliability
Factor analysis

แล้วทุกคนจะแปลกใจว่า ทำไมภาษา R ง่ายขนาดนี้ 🙂‍↕️

👉 สนใจดูรายละเอียดหนังสือได้ที่ meb:

ดูรายละเอียดหนังสือ R Book for Psychologists

2025-05-29

Tag: Linear regression

Generalised Linear Model: วิธีใช้ glm() ในภาษา R เพื่อทำนายข้อมูลที่ไม่ปกติ — Linear, Logistic, และ Poisson Regression

🤔 GLM คืออะไร?

💻 GLM ในภาษา R

☕ ตัวอย่างข้อมูล: coffee_shop

1️⃣ Linear Regression

2️⃣ Logistic Regression

3️⃣ Poisson Regression

💪 Summary

😺 GitHub

📃 References

✅ R Book for Psychologists: หนังสือภาษา R สำหรับนักจิตวิทยา

Share this:

วิธีสร้าง linear regression ด้วย lm() ในภาษา R — ตัวอย่างการทำนายราคาเพชรใน diamonds dataset

💎 Example Dataset: diamonds

⬇️ Load diamonds

🍳 Prepare the Dataset

🪆 Step 1. One-Hot Encoding

📈 Step 2. Log Transformation

🚄 Step 3. Split the Data

🏷️ Linear Regression Modelling

💪 Step 1. Fit the Model

🔮 Step 2. Make Predictions

🎯 Step 3. Evaluate the Model Performance

😎 Summary

😺 GitHub

📃 References

✅ R Book for Psychologists: หนังสือภาษา R สำหรับนักจิตวิทยา

Share this: