Tag: pandas

โหลดข้อมูลจาก database ใน 4 ขั้นตอน ด้วย sqlalchemy และ pandas ใน Python — ตัวอย่างการทำงานกับ Chinook database
ในบทความนี้ เราจะมาดู 4 ขั้นตอนในการโหลดข้อมูลจาก database ด้วย sqlalchemy และ pandas libraries ใน Python ผ่านตัวอย่างการทำงานกับ Chinook database กัน:
1. Import libraries
2. Connect to the database
3. List the tables
4. Get the table
ถ้าพร้อมแล้ว ไปเริ่มกันเลย
⬇️ 1. Import Libraries

ในขั้นแรก เราจะโหลด sqlalchemy และ pandas กัน:
```
# Import packages
from sqlalchemy import create_engine, inspect
import pandas as pd
```
Note: ถ้ายังไม่เคยติดตั้ง libraries ให้ใช้คำสั่ง !pip install ก่อนใช้ import

🛜 2. Connect to the Database

ในขั้นที่ 2 เราจะเชื่อมต่อกับ database

ในตัวอย่าง เราจะเชื่อมต่อกับ SQLite database บนเครื่อง ซึ่งเราสามารถทำได้ด้วย create_engine() แบบนี้:
```
# Connect to the database
engine = create_engine("sqlite:///chinook.sqlite")
```
Note: ดาวน์โหลด chinook.sqlite ได้ที่ GitHub

📋 3. List the Tables

ในขั้นที่ 3 เราจะโหลดรายชื่อ tables ใน database เพื่อเลือก tables ที่เราต้องการ

เราจะใช้ 2 คำสั่ง ได้แก่:
- inspect(): function สำหรับสร้าง object ที่เก็บ metadata ของ database เอาไว้
- .get_table_names(): method สำหรับแสดงรายชื่อ tables ใน database
```
# Get the inspector
inspector = inspect(engine)

# List the table names
tables = inspector.get_table_names()

# Print the table names
print(tables)
```
ผลลัพธ์:
```
['Album', 'Artist', 'Customer', 'Employee', 'Genre', 'Invoice', 'InvoiceLine', 'MediaType', 'Playlist', 'PlaylistTrack', 'Track']
```
🪑 4. Get the Table

ในขั้นสุดท้าย เราจะโหลดข้อมูลจาก table ที่ต้องการ โดยใช้ pd.read_sql():
```
# Set the query
brazil_customers_query = """
SELECT FirstName, LastName, Phone, Email
FROM Customer
WHERE Country = 'Brazil';
"""

# Query the database
df = pd.read_sql(brazil_customers_query, engine)

# Display the df
print(df)
```
ผลลัพธ์:
```
   FirstName   LastName               Phone                          Email
0       Luís  Gonçalves  +55 (12) 3923-5555           luisg@embraer.com.br
1    Eduardo    Martins  +55 (11) 3033-5446       eduardo@woodstock.com.br
2  Alexandre      Rocha  +55 (11) 3055-3278               alero@uol.com.br
3    Roberto    Almeida  +55 (21) 2271-7000  roberto.almeida@riotur.gov.br
4   Fernanda      Ramos  +55 (61) 3363-5547       fernadaramos4@uol.com.br
```
😺 GitHub

ดูตัวอย่าง code ทั้งหมดได้ที่ GitHub

📃 References
- Introduction to Importing Data in Python
- Importing Data in Python Cheat Sheet
Share this:
Facebook
X
Like Loading…
2026-02-19

วิธีใช้ 9 arguments ใน read_csv() จาก pandas library เพื่อโหลดข้อมูลใน Python — ตัวอย่างการโหลดข้อมูลการแข่งขันฟุตบอล

pandas เป็น Python library สำหรับทำงานกับข้อมูลในรูปแบบตาราง (tabular data) และมี functions หลากหลายสำหรับโหลดข้อมูลเข้ามาใน Python

โดยหนึ่งใน functions ที่นิยมใช้กันมากที่สุด ได้แก่ read_csv() ซึ่งใช้โหลดข้อมูล CSV (Comma-Separated Values) และมี arguments หลัก 9 อย่าง ได้แก่:

filepath_or_buffer: file path, ชื่อไฟล์, หรือ URL ของไฟล์ที่ต้องการโหลด
sep: กำหนด delimiter
header: กำหนด row ที่เป็นหัวตาราง
skiprows: กำหนด rows ที่ไม่ต้องการโหลด
nrows: เลือกจำนวน rows ที่ต้องการโหลด
usecols: กำหนด columns ที่ต้องการโหลด
index_col: กำหนด column ที่จะเป็น index
names: กำหนดชื่อของ columns
dtype: กำหนดประเภทข้อมูล (data types) ของ columns

ในบทความนี้ เราจะมาดูวิธีใช้ทั้ง 9 arguments ของ read_csv() เพื่อโหลดตัวอย่างข้อมูลการแข่งขันฟุตบอลในอังกฤษกัน

ถ้าพร้อมแล้ว ไปเริ่มกันเลย

🏁 Getting Started

ก่อนเริ่มใช้งาน read_csv() เราต้องติดตั้งและโหลด pandas ก่อน:

# Install pandas
!pip install pandas

# Import pandas
import pandas as pd

Note: ในกรณีที่เราเคยติดตั้ง pandas แล้วให้ใช้คำสั่ง import อย่างเดียว

🗃️ Argument #1. filepath_or_buffer

filepath_or_buffer เป็น argument หลักที่เราจะต้องกำหนดทุกครั้งที่เรียกใช้ read_csv()

ยกตัวอย่างเช่น เรามีข้อมูลการแข่งขันฟุตบอล (matches_clean.csv):

MatchID,HomeTeam,AwayTeam,HomeGoals,AwayGoals,MatchDate
M001,Manchester United,Chelsea,2,1,2024-08-14
M002,Liverpool,Arsenal,1,1,2024-08-20
M003,Tottenham,Everton,3,0,2024-09-02
M004,Man City,Aston Villa,4,2,2024-09-15
M005,Newcastle,West Ham,0,0,2024-09-22
M006,Brighton,Leeds,2,3,2024-09-29

เราสามารถใช้ read_csv() ได้แบบนี้:

# Load the dataset
df1 = pd.read_csv("matches_clean.csv")

# View the result
print(df1)

ผลลัพธ์:

  MatchID           HomeTeam     AwayTeam  HomeGoals  AwayGoals   MatchDate
0    M001  Manchester United      Chelsea          2          1  2024-08-14
1    M002          Liverpool      Arsenal          1          1  2024-08-20
2    M003          Tottenham      Everton          3          0  2024-09-02
3    M004           Man City  Aston Villa          4          2  2024-09-15
4    M005          Newcastle     West Ham          0          0  2024-09-22
5    M006           Brighton        Leeds          2          3  2024-09-29

🤺 Argument #2. sep

sep ใช้กำหนด delimiter หรือเครื่องหมายในการแบ่ง columns โดย default ของ sep คือ "," ทำให้ปกติ เราไม่ต้องกำหนด sep เมื่อไฟล์เป็น CSV

เราจะใช้ sep เมื่อข้อมูลมี delimiter อื่น เช่น ";" (matches_semicolon.txt):

MatchID;HomeTeam;AwayTeam;HomeGoals;AwayGoals;MatchDate
M001;Manchester United;Chelsea;2;1;2024-08-14
M002;Liverpool;Arsenal;1;1;2024-08-20
M003;Tottenham;Everton;3;0;2024-09-02
M004;Man City;Aston Villa;4;2;2024-09-15
M005;Newcastle;West Ham;0;0;2024-09-22
M006;Brighton;Leeds;2;3;2024-09-29

เราสามารถใช้ sep ได้แบบนี้:

# Load the dataset with ";" as delim
df2 = pd.read_csv("matches_semicolon.csv", sep=";")

# View the result
print(df2)

ผลลัพธ์:

  MatchID           HomeTeam     AwayTeam  HomeGoals  AwayGoals   MatchDate
0    M001  Manchester United      Chelsea          2          1  2024-08-14
1    M002          Liverpool      Arsenal          1          1  2024-08-20
2    M003          Tottenham      Everton          3          0  2024-09-02
3    M004           Man City  Aston Villa          4          2  2024-09-15
4    M005          Newcastle     West Ham          0          0  2024-09-22
5    M006           Brighton        Leeds          2          3  2024-09-29

😶‍🌫️ Argument #3. header

header ใช้กำหนด row ที่จะเป็นหัวตาราง

เราจะใช้ header เมื่อ rows แรกของข้อมูลมีข้อมูลอื่น เช่น metadata (matches_with_metadata.txt):

# UK Football Matches Data
# Created for practice with pd.read_csv()
MatchID,HomeTeam,AwayTeam,HomeGoals,AwayGoals,MatchDate
M001,Manchester United,Chelsea,2,1,2024-08-14
M002,Liverpool,Arsenal,1,1,2024-08-20
M003,Tottenham,Everton,3,0,2024-09-02
M004,Man City,Aston Villa,4,2,2024-09-15
M005,Newcastle,West Ham,0,0,2024-09-22
M006,Brighton,Leeds,2,3,2024-09-29

เราสามารถใช้ header ได้แบบนี้:

# Load the dataset where the header is the 3rd row
df3 = pd.read_csv("matches_with_metadata.txt", header=2)

# View the result
print(df3)

ผลลัพธ์:

  MatchID           HomeTeam     AwayTeam  HomeGoals  AwayGoals   MatchDate
0    M001  Manchester United      Chelsea          2          1  2024-08-14
1    M002          Liverpool      Arsenal          1          1  2024-08-20
2    M003          Tottenham      Everton          3          0  2024-09-02
3    M004           Man City  Aston Villa          4          2  2024-09-15
4    M005          Newcastle     West Ham          0          0  2024-09-22
5    M006           Brighton        Leeds          2          3  2024-09-29

จะสังเกตว่า metadata จะไม่ถูกโหลดเข้ามาด้วย

Note: เราสามารถกำหนด header=None ในกรณีที่ข้อมูลไม่มีหัวตาราง เช่น matches_no_header.csv:

M001,Manchester United,Chelsea,2,1,2024-08-14
M002,Liverpool,Arsenal,1,1,2024-08-20
M003,Tottenham,Everton,3,0,2024-09-02
M004,Man City,Aston Villa,4,2,2024-09-15
M005,Newcastle,West Ham,0,0,2024-09-22
M006,Brighton,Leeds,2,3,2024-09-29

🛑 Argument #4. skiprows

skiprows ใช้เลือก rows ที่เราไม่ต้องการโหลดเข้ามาใน Python ซึ่งเราสามารถกำหนดได้ 2 แบบ:

กำหนดเป็น int (เช่น 2) ในกรณีที่ต้องการข้าม row เดียว
กำหนดเป็น list (เช่น [0, 1, 2]) ในกรณีที่ต้องการข้ามมากกว่า 1 rows

ยกตัวอย่างเช่น เราต้องการข้าม 2 บรรทัดแรกซึ่งเป็น metadata:

# UK Football Matches Data
# Created for practice with pd.read_csv()
MatchID,HomeTeam,AwayTeam,HomeGoals,AwayGoals,MatchDate
M001,Manchester United,Chelsea,2,1,2024-08-14
M002,Liverpool,Arsenal,1,1,2024-08-20
M003,Tottenham,Everton,3,0,2024-09-02
M004,Man City,Aston Villa,4,2,2024-09-15
M005,Newcastle,West Ham,0,0,2024-09-22
M006,Brighton,Leeds,2,3,2024-09-29

เราสามารถใช้ skiprows ได้แบบนี้:

# Load the dataset, skipping the metadata
df4 = pd.read_csv("matches_with_metadata.txt", skiprows=[0, 1])

# View the result
print(df4)

ผลลัพธ์:

  MatchID           HomeTeam     AwayTeam  HomeGoals  AwayGoals   MatchDate
0    M001  Manchester United      Chelsea          2          1  2024-08-14
1    M002          Liverpool      Arsenal          1          1  2024-08-20
2    M003          Tottenham      Everton          3          0  2024-09-02
3    M004           Man City  Aston Villa          4          2  2024-09-15
4    M005          Newcastle     West Ham          0          0  2024-09-22
5    M006           Brighton        Leeds          2          3  2024-09-29

📋 Argument #5. nrows

nrows ใช้เลือก rows ที่เราต้องการโหลดเข้ามาใน Python

เช่น แทนที่จะโหลดข้อมูลทั้งหมด:

MatchID,HomeTeam,AwayTeam,HomeGoals,AwayGoals,MatchDate
M001,Manchester United,Chelsea,2,1,2024-08-14
M002,Liverpool,Arsenal,1,1,2024-08-20
M003,Tottenham,Everton,3,0,2024-09-02
M004,Man City,Aston Villa,4,2,2024-09-15
M005,Newcastle,West Ham,0,0,2024-09-22
M006,Brighton,Leeds,2,3,2024-09-29

เราจะโหลดข้อมูล 3 rows แรกด้วย nrows แบบนี้:

# Load the first 3 rows
df5 = pd.read_csv("matches_clean.csv", nrows=3)

# View the result
print(df5)

ผลลัพธ์:

  MatchID           HomeTeam AwayTeam  HomeGoals  AwayGoals   MatchDate
0    M001  Manchester United  Chelsea          2          1  2024-08-14
1    M002          Liverpool  Arsenal          1          1  2024-08-20
2    M003          Tottenham  Everton          3          0  2024-09-02

☑️ Argument #6. usecols

usecols ใช้กำหนด columns ที่เราต้องการโหลดเข้ามาใน Python

ยกตัวอย่างเช่น เลือกเฉพาะ HomeTeam และ HomeGoals จาก:

MatchID,HomeTeam,AwayTeam,HomeGoals,AwayGoals,MatchDate
M001,Manchester United,Chelsea,2,1,2024-08-14
M002,Liverpool,Arsenal,1,1,2024-08-20
M003,Tottenham,Everton,3,0,2024-09-02
M004,Man City,Aston Villa,4,2,2024-09-15
M005,Newcastle,West Ham,0,0,2024-09-22
M006,Brighton,Leeds,2,3,2024-09-29

เราสามารถใช้ usecols ได้แบบนี้:

# Load only HomeTeam and HomeGoals
df6 = pd.read_csv("matches_clean.csv", usecols=["HomeTeam", "HomeGoals"])

# View the result
print(df6)

ผลลัพธ์:

            HomeTeam  HomeGoals
0  Manchester United          2
1          Liverpool          1
2          Tottenham          3
3           Man City          4
4          Newcastle          0
5           Brighton          2

🔢 Argument #7. index_col

index_col ใช้กำหนด column ที่เป็น index ของข้อมูล เช่น MatchID:

MatchID,HomeTeam,AwayTeam,HomeGoals,AwayGoals,MatchDate
M001,Manchester United,Chelsea,2,1,2024-08-14
M002,Liverpool,Arsenal,1,1,2024-08-20
M003,Tottenham,Everton,3,0,2024-09-02
M004,Man City,Aston Villa,4,2,2024-09-15
M005,Newcastle,West Ham,0,0,2024-09-22
M006,Brighton,Leeds,2,3,2024-09-29

เราจะใช้ index_col แบบนี้:

# Load the dataset with MatchID as index col
df7 = pd.read_csv("matches_clean.csv", index_col="MatchID")

# View the result
print(df7)

ผลลัพธ์:

                  HomeTeam     AwayTeam  HomeGoals  AwayGoals   MatchDate
MatchID
M001     Manchester United      Chelsea          2          1  2024-08-14
M002             Liverpool      Arsenal          1          1  2024-08-20
M003             Tottenham      Everton          3          0  2024-09-02
M004              Man City  Aston Villa          4          2  2024-09-15
M005             Newcastle     West Ham          0          0  2024-09-22
M006              Brighton        Leeds          2          3  2024-09-29

🔠 Argument #8. names

names ใช้กำหนดชื่อ columns ซึ่งเราจะใช้เมื่อ:

ข้อมูลไม่มีหัวตาราง
ต้องการเปลี่ยนชื่อ columns

ยกตัวอย่างเช่น ใส่ชื่อ columns ให้กับ matches_no_header.csv:

M001,Manchester United,Chelsea,2,1,2024-08-14
M002,Liverpool,Arsenal,1,1,2024-08-20
M003,Tottenham,Everton,3,0,2024-09-02
M004,Man City,Aston Villa,4,2,2024-09-15
M005,Newcastle,West Ham,0,0,2024-09-22
M006,Brighton,Leeds,2,3,2024-09-29

เราสามารถใช้ names ได้แบบนี้:

# Set col names
col_names = [
    "id",
    "home",
    "away",
    "home_goals",
    "away_goals",
    "date"
]

# Load the dataset with custom col names
df8 = pd.read_csv("matches_no_header.csv", names=col_names)

# View the result
print(df8)

ผลลัพธ์:

     id               home         away  home_goals  away_goals        date
0  M001  Manchester United      Chelsea           2           1  2024-08-14
1  M002          Liverpool      Arsenal           1           1  2024-08-20
2  M003          Tottenham      Everton           3           0  2024-09-02
3  M004           Man City  Aston Villa           4           2  2024-09-15
4  M005          Newcastle     West Ham           0           0  2024-09-22
5  M006           Brighton        Leeds           2           3  2024-09-29

⏹️ Argument #9. dtype

dtype ใช้กำหนดประเภทข้อมูลของ columns

ยกตัวอย่างเช่น กำหนด ประเภทข้อมูลของ MatchID, HomeGoals, และ AwayGoals จาก matches_clean.csv:

MatchID,HomeTeam,AwayTeam,HomeGoals,AwayGoals,MatchDate
M001,Manchester United,Chelsea,2,1,2024-08-14
M002,Liverpool,Arsenal,1,1,2024-08-20
M003,Tottenham,Everton,3,0,2024-09-02
M004,Man City,Aston Villa,4,2,2024-09-15
M005,Newcastle,West Ham,0,0,2024-09-22
M006,Brighton,Leeds,2,3,2024-09-29

เราสามารถใช้ dtype ได้แบบนี้:

# Set col data types
col_dtypes = {
    "MatchID": str,
    "HomeGoals": "int32",
    "AwayGoals": "int32"
}

# Load the dataset, specifying data types for MatchID, HomeGoals, and AwayGoals
df9 = pd.read_csv("matches_clean.csv", dtype=col_dtypes)

# View the result
df9.info()

ผลลัพธ์:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   MatchID    6 non-null      object
 1   HomeTeam   6 non-null      object
 2   AwayTeam   6 non-null      object
 3   HomeGoals  6 non-null      int32
 4   AwayGoals  6 non-null      int32
 5   MatchDate  6 non-null      object
dtypes: int32(2), object(4)
memory usage: 372.0+ bytes

⚡ Summary

ในบทความนี้ เราได้ไปดูวิธีการใช้ 9 arguments ของ read_csv() จาก pandas เพื่อโหลดข้อมูลใน Python กัน:

filepath_or_buffer: ไฟล์ที่ต้องการโหลด
sep: delimiter ในไฟล์
header: row ที่เป็นหัวตาราง
skiprows: rows ที่ไม่ต้องการโหลด
nrows: จำนวน rows ที่ต้องการโหลด
usecols: columns ที่ต้องการโหลด
index_col: column ที่จะเป็น index
names: ชื่อของ columns
dtype: ประเภทข้อมูล (data types) ของ columns

😺 GitHub

ดูตัวอย่าง code และ datasets ในบทความนี้ได้ที่ GitHub

📃 References

2025-10-30

pandas fundamentals: 5 กลุ่ม pd functions ที่ควรรู้ในการทำงานกับข้อมูล พร้อมตัวอย่างจาก Spotify dataset

pandas เป็น library ใน Python ที่นิยมใช้ทำงานกับ data เพราะ:

pandas สามารถเก็บข้อมูลในรูปแบบ table หรือ data frame ได้
มี functions/methods สำหรับทำงานกับ data frame

ในบทความนี้ เราจะมาดูวิธีใช้ pandas ในการทำงานกับ data เบื้องต้นกัน

โดย functions ของ pandas ที่เราจะดู แบ่งเป็น 5 กลุ่ม ได้แก่:

No.	Group	Description
1	Exploring	สำรวจข้อมูลเบื้องต้น
2	Selecting and filtering	เลือกและกรองข้อมูล
3	Sorting	จัดลำดับข้อมูล
4	Slicing	ตัดแบ่งข้อมูล
5	Aggregating	สรุปข้อมูล

ถ้าพร้อมแล้ว ไปเริ่มกันเลย

🎧 Dataset: Spotify Tracks

ในบทความนี้ dataset ที่จะใช้เป็นตัวอย่าง คือ Spotify Tracks Dataset จาก Kaggle

Spotify Tracks Dataset เป็นชุดข้อมูลเพลงใน Spotify ทั้งหมด 125 แนวเพลง และประกอบด้วย 20 columns เช่น:

track_name: ชื่อเพลง
artists: ชื่อศิลปิน
popularity: คะแนนความนิยม
energy: ความดัง + ความเร็ว
liveness: เป็นเพลง live หรืออัดใน studio

▶️ Press Play

ก่อนไปดูการใช้งาน pandas เรามาดูวิธีการเตรียม pandas และ dataset กันก่อน:

Install
Import
Read

1️⃣ Install

ในการใช้งาน pandas ให้เราเริ่มจากติดตั้ง pandas ก่อน:

# Install pandas
!pip install pandas

Note: ถ้าใครติดตั้งแล้ว สามารถไปที่ step ต่อไปได้เลย

2️⃣ Import

หลังติดตั้ง pandas แล้ว ให้เรียกใช้งานผ่านคำสั่ง import:

# Load pandas
import pandas as pd

Note:

pandas มักใช้ตัวย่อ pd เพื่อง่ายต่อการทำงาน
ทุกครั้งที่เปิด session ใหม่ จะต้อง run บรรทัดนี้ก่อนทำงานเสมอ

3️⃣ Read

เมื่อติดตั้งและเรียกใช้งาน pandas แล้ว ให้โหลด dataset ที่ต้องการใช้งานซึ่งในกรณีนี้ คือ Spotify Tracks Dataset ซึ่งเป็นไฟล์ CSV โดยเราจะโหลดผ่าน read_csv() ของ pandas:

# Load the dataset
spotify = pd.read_csv("spotify_tracks_dataset.csv", index_col=0)

Note: ในกรณีของ Spotify Tracks Dataset เราต้องใช้ index_col=0 เพื่อบอก pandas ว่า เราจะไม่ต้องการสร้าง column ที่เป็น running number

หลังจากทำครบทั้ง 3 ขั้นตอนนี้แล้ว เราสามารถเริ่มทำงานกับข้อมูลด้วย pandas กันได้เลย

🎶 Playlist #1 – Exploring

เริ่มแรก เรามาดูการใช้งาน pandas เพื่อสำรวจข้อมูลเบื้องต้นกัน

Functions ในกลุ่มนี้ประกอบด้วย 4 functions/methods:

.head()
.info()
.describe()
.shape

1️⃣ .head()

Use case:

เรียกดู 5 rows แรกของ dataset

ตัวอย่าง:

# View the first 5 rows
spotify.head()

ผลลัพธ์:

Note:

ถ้าต้องการดูมากกว่า 5 rows ให้ใส่จำนวน rows ที่ต้องการ เช่น spotify.head(10) จะเรียกดู 10 rows แรก

2️⃣ .info()

Use case:

ดูข้อมูลภาพรวมของ dataset

ตัวอย่าง:

# Get overview of the dataset
spotify.info()

ผลลัพธ์:

3️⃣ .describe()

Use case:

เรียกดู summary stats ของ datasets ซึ่งได้แก่:

Count
Mean
Standard deviation (std)
Min
Quartiles
- 25
- 50
- 75
Max

ตัวอย่าง:

# Get summary stats
spotify.describe()

ผลลัพธ์:

Note:

โดย default, .describe() จะสรุปข้อมูลเฉพาะ column ที่เป็น numerical variable เท่านั้น

ถ้าเราต้องการ summary stats ของ categorical variable เราสามารถใช้ argument include="all" ได้:

# Get summary stats for all variable types
spotify.describe(include="all")

ผลลัพธ์:

จะเห็นได้ว่า ตอนนี้เราจะได้ summary stats ของทั้ง numerical (เช่น popularity) และ categorical variables (เช่น artists)

4️⃣ .shape

Use case:

ดูจำนวน rows และ columns ของ dataset

ตัวอย่าง:

# See the dimensions of the dataset
spotify.shape

ผลลัพธ์:

114000 คือ จำนวน rows
20 คือ จำนวน columns

🎶 Playlist #2 – Selecting & Filtering

ในกลุ่มการใช้งานที่ 2 เรามาดูวิธีการเลือกและกรองข้อมูลกัน:

df[condition]
.query()

1️⃣ df[condition]

Use case:

df[condition] เป็น syntax เพื่อกรองข้อมูล

ตัวอย่าง:

เราต้องการดูข้อมูลเพลงที่มีคะแนนความนิยม (popularity) สูงกว่า 80:

# Select records where popularity is greater than 80
spotify[spotify["popularity"] > 80]

ผลลัพธ์:

Note:

ในการกรอง เราสามารถใช้ comparison operators เหล่านี้ช่วยได้:

Comparison Operator	Meaning
`==`	เท่ากับ
`!=`	ไม่เท่ากับ
`>`	มากกว่า
`>=`	มากกว่า/เท่ากับ
`<`	น้อยกว่า
`<=`	น้อยกว่า/เท่ากับ

นอกจากนี้ เราสามารถใช้ Boolean operators เพื่อเพิ่ม conditions ในการกรองข้อมูลได้:

Boolean Operator	Meaning
`&`	and
`	`
`!`	not

เช่น ดูข้อมูลเพลงที่มีคะแนนความนิยม (popularity) สูงกว่า 80 จากวง The Neighbourhood (ดูจาก artists):

# Select records where popularity is greater than 80 from The Neighbourhood
spotify[(spotify["popularity"] > 80) & (spotify["artists"] == "The Neighbourhood")]

ผลลัพธ์:

2️⃣ .query()

Use case:

.query() ทำหน้าที่คล้ายกับ df[condition] นั่นคือ กรองข้อมูล

แต่ .query() มีข้อดีอยู่ 2 อย่าง:

ใช้งานง่าย
เหมาะกับการกรองข้อมูล ด้วย conditions ที่ซับซ้อน

การเขียน input ของ .query() เราจะใช้ใน syntax ของ SQL

ตัวอย่าง:

จากตัวอย่างก่อนหน้านี้ที่เราต้องการดูข้อมูลเพลงที่:

มีคะแนนความนิยม (popularity) สูงกว่า 80
จากวง The Neighbourhood (ดูจาก artists)

เราสามารถเขียน .query() ได้ดังนี้:

# Filter with .query()
spotify.query("popularity > 80 and artists == 'The Neighbourhood'")

ผลลัพธ์:

Note:

ถ้าเราเทียบระหว่าง df[condition] และ .query() :

df[condition]	.query()
`spotify[(spotify["popularity"] > 80) & (spotify["artists"] == "The Neighbourhood")]`	`spotify.query("popularity > 80 and artists == 'The Neighbourhood'")`

จะเห็นว่า .query():

สั้นกว่า
ทำความเข้าใจได้ง่ายกว่า

🎶 Playlist #3 – Sorting

หลังจากกรองข้อมูล บางครั้งเราอยากจะจัดลำดับข้อมูล เพื่อช่วยในการทำความเข้าใจข้อมูล:

.sort_values()

1️⃣ .sort_values()

Use case:

จัดเรียงข้อมูล โดย:

default จะเรียงจากน้อยไปมาก (A-Z)
ถ้าต้องการเรียงจากมากไปน้อย (Z-A) ให้ใช้ ascending=False

ตัวอย่าง:

ต้องการเรียงเพลงตามคะแนนความนิยม (popularity) จากสูงไปต่ำ เพื่อหาเพลงฮิต:

# Sort tracks by popularity in descending order
spotify.sort_values(by="popularity", ascending=False)

ผลลัพธ์:

🎶 Playlist #4 – Slicing

ในกลุ่มนี้ เราจะมาดู 4 วิธี เพื่อดึง rows และ/หรือ columns ออกจาก dataset กัน:

df[column]
.loc[]
.iloc[]
.filter()

1️⃣ df[column_name]

Use case:

df[column_name] เป็น syntax เพื่อเลือกข้อมูลจาก column ที่ต้องการ โดย:

df หมายถึง ชื่อ dataset
column_name หมายถึง ชื่อ column ที่เราเลือก

ตัวอย่าง:

เลือกดูชื่อเพลง (track_name):

# Select column track_name
spotify["track_name"]

ผลลัพธ์:

Note:

ถ้าต้องการมากกว่า 1 column เราสามารถใส่ input เป็น list ได้

เช่น เลือกชื่อเพลง (track_name) และคะแนนความนิยม (popularity):

# Select columns track_name and popularity
spotify[["track_name", "popularity"]]

ผลลัพธ์:

2️⃣ .loc[]

Use case:

เลือก rows และ/หรือ columns โดยใช้ ชื่อ rows และ columns (label-based)

Syntax:

.loc[] มีหลักการใช้งานดังนี้:

Syntax	For
`df.loc[r_lab]`	เลือก 1 row
`df.loc[rx:ry]`	เลือกมากกว่า 1 rows
`df.loc[[r_list]]`	เลือกมากกว่า 1 rows
`df.loc[:, c_lab]`	เลือก 1 column
`df.loc[:, cx:cy]`	เลือกมากกว่า 1 column
`df.loc[:, [c_list]]`	เลือกมากกว่า 1 columns

r_lab คือ ชื่อ row
rx:ry คือ ช่วง rows ที่ต้องการเลือก
[r_list] คือ list ของ rows ที่ต้องการเลือก
c_lab คือ ชื่อ column ที่ต้องการเลือก
cx:cy คือ ช่วง columns ที่ต้องการเลือก
[c_list] คือ list ของ columns ที่ต้องการเลือก

ตัวอย่าง:

เลือก 5 rows แรก และแสดงเฉพาะ:

ชื่อเพลง (track_name)
ชื่อศิลปิน (artists)
คะแนนความนิยม (popularity)

# Select first 5 rows from track_name, artists, popularity
spotify.loc[0:4, ["track_name", "artists", "popularity"]]

ผลลัพธ์:

3️⃣ .iloc[]

Use case:

เลือก rows และ/หรือ columns โดยใช้ ตำแหน่ง rows และ columns (position-based)

Syntax:

.iloc[] มีวิธีการใช้งาน คล้ายกับ .loc[] ดังนี้:

Syntax	For
`df.iloc[r_index]`	เลือก 1 row
`df.iloc[rx:ry]`	เลือกมากกว่า 1 rows
`df.iloc[[r_list]]`	เลือกมากกว่า 1 rows
`df.iloc[:, c_index]`	เลือก 1 column
`df.iloc[:, cx:cy]`	เลือกมากกว่า 1 column
`df.iloc[:, [c_list]]`	เลือกมากกว่า 1 columns

r_index คือ ตำแหน่ง row
rx:ry คือ ช่วง rows ที่ต้องการเลือก
[r_list] คือ list ของ rows ที่ต้องการเลือก
c_index คือ ตำแหน่ง column ที่ต้องการเลือก
cx:cy คือ ช่วง columns ที่ต้องการเลือก
[c_list] คือ list ของ columns ที่ต้องการเลือก

ความแตกต่างระหว่าง .loc[] และ .iloc[] คือ สิ่งที่ใช้ในการเลือก s และ columns:

.loc[] ใช้ ชื่อ (label)
.iloc[] ใช้ ตำแหน่ง (position)

ตัวอย่าง:

เลือก 5 rows แรก และแสดงเฉพาะ:

ชื่อเพลง (track_name)
ชื่อศิลปิน (artist_name)
คะแนนความนิยม (popularity)

# Select first 5 rows from track_name, artists, popularity
spotify.iloc[0:5, [0, 1, 5]]

ผลลัพธ์:

4️⃣ .filter()

Use case:

.filter() ทำหน้าที่คล้ายกับ df[condition] แต่ทรงพลังกว่า เพราะเลือกกรองข้อมูลได้ทั้ง rows และ columns

Syntax:

df.filter(condition, axis)

df คือ ชื่อ dataset
condition คือ เงื่อนไขในการเลือกข้อมูล ซึ่งเรามี 3 parametres ให้เลือกใช้:
- items กรองตาม labels ของ rows หรือ columns
- like กรองตาม คำค้นหา
- reg กรองตาม regular expression
axis ระบุว่า ต้องการเลือก rows (0) หรือ columns (1)

ตัวอย่าง:

เลือก rows ที่เลข 123:

# Select rows with "123"
spotify.filter(like="123", axis=0)

ผลลัพธ์:

หรือ เลือกข้อมูลจาก columns:

ชื่อเพลง (track_name)
ชื่อศิลปิน (artist_name)
คะแนนความนิยม (popularity)

# Select first 5 rows from track_name, artists, popularity
spotify.filter(items=["track_name", "artists", "popularity"])

ผลลัพธ์:

🎶 Playlist #5 – Aggregating

สุดท้าย เรามาดูวิธีการสรุปข้อมูลกัน:

Aggregation functions
.agg()
.groupby()

1️⃣ Aggregation Functions

ในกรณีที่เราต้องการ คำนวณค่าทางสถิติ pandas มี functions ให้เลือกใช้งานมากมาย เช่น:

Function	Meaning
`.sum()`	หาผลรวม
`.mean()`	หาค่าเฉลี่ย
`.median()`	หาค่ากลาง
`.mode()`	หาค่าที่ซ้ำมากที่สุด
`.min()`	หาค่าน้อยที่สุด
`.max()`	หาค่ามากที่สุด
`.std()`	หา standard deviation (SD)
`.cumsum()`	หาผลรวมสะสม
`.value_counts()`	นับจำนวนข้อมูล
`.nunique()`	นับจำนวนข้อมูลที่ไม่ซ้ำ

ตัวอย่าง:

ต้องการหาค่าเฉลี่ยของคะแนนความนิยม (popularity):

# Calculate the mean of popularity
spotify["popularity"].mean()

ผลลัพธ์:

หรือหา SD ของคะแนนความนิยม (popularity):

# Calculate the SD of popularity
spotify["popularity"].std()

ผลลัพธ์:

2️⃣ .agg()

Use case:

ในบางครั้ง เราต้องการคำนวณหลายค่าทางสถิติพร้อมกัน เช่น ตัวอย่างก่อนหน้านี้ที่เราต้องการหา mean และ SD

แทนที่เราจะเขียน code เพื่อแสดงผลแยกกัน เช่น:

# Calculate mean
spotify["popularity"].mean()

# Calculate SD
spotify["popularity"].std()

เราสามารถใช้ agg() เพื่อช่วยลดเวลาได้

ตัวอย่าง:

หาค่าเฉลี่ยและ SD ของคะแนนความนิยม (popularity):

# Calculate mean and SD of popularity
spotify["popularity"].agg(["mean", "std"])

ผลลัพธ์:

Note:

เราสามารถใช้ .agg() เพื่อคำนวณค่าทางสถิติกับหลาย column พร้อมกันได้ เช่น หาค่า:

mean
std

ให้กับ:

คะแนนความนิยม
ความยาวของเพลง

# Calculate mean and SD for popularity and duration_ms
spotify[["popularity", "duration_ms"]].agg({
		"popularity": ["mean", "std"],
		"duration_ms": ["mean", "std"]
		})

ผลลัพธ์:

3️⃣ .groupby()

Use case:

บางครั้ง เราต้องการคำนวณค่าทางสถิติตามกลุ่มข้อมูล

เราสามารถใช้ .groupby() เพื่อจับกลุ่มข้อมูล ก่อนจะคำนวณค่าทางสถิติได้

ตัวอย่าง:

ต้องการหา ค่าเฉลี่ยและ SD ของคะแนนความนิยม (popularity) ของศิลปินแต่ละคน:

# Group by artists and calculate mean of popularity 
spotify.groupby("artists")["popularity"].agg(["mean", "std"])

ผลลัพธ์:

⏭️ Next Song

💻 Example Code

สำหรับคนที่ลองรัน code ด้วยตัวเอง สามารถโหลด code ตัวอย่างได้ที่ GitHub

📚 Further Reading

สำหรับคนที่สนใจเรียนรู้เพิ่มเติมเกี่ยวกับ pandas สามารถศึกษาต่อได้ตาม links ด้านล่าง:

2025-03-27

Tag: pandas

โหลดข้อมูลจาก database ใน 4 ขั้นตอน ด้วย sqlalchemy และ pandas ใน Python — ตัวอย่างการทำงานกับ Chinook database

⬇️ 1. Import Libraries

🛜 2. Connect to the Database

📋 3. List the Tables

🪑 4. Get the Table

😺 GitHub

📃 References

Share this:

วิธีใช้ 9 arguments ใน read_csv() จาก pandas library เพื่อโหลดข้อมูลใน Python — ตัวอย่างการโหลดข้อมูลการแข่งขันฟุตบอล

🏁 Getting Started

🗃️ Argument #1. filepath_or_buffer

🤺 Argument #2. sep

😶‍🌫️ Argument #3. header

🛑 Argument #4. skiprows

📋 Argument #5. nrows

☑️ Argument #6. usecols

🔢 Argument #7. index_col

🔠 Argument #8. names

⏹️ Argument #9. dtype

⚡ Summary

😺 GitHub

📃 References

Share this:

pandas fundamentals: 5 กลุ่ม pd functions ที่ควรรู้ในการทำงานกับข้อมูล พร้อมตัวอย่างจาก Spotify dataset

🎧 Dataset: Spotify Tracks

▶️ Press Play

1️⃣ Install

2️⃣ Import

3️⃣ Read

🎶 Playlist #1 – Exploring

1️⃣ .head()

2️⃣ .info()

3️⃣ .describe()

4️⃣ .shape

🎶 Playlist #2 – Selecting & Filtering

1️⃣ df[condition]

2️⃣ .query()

🎶 Playlist #3 – Sorting

1️⃣ .sort_values()

🎶 Playlist #4 – Slicing

1️⃣ df[column_name]

2️⃣ .loc[]

3️⃣ .iloc[]

4️⃣ .filter()

🎶 Playlist #5 – Aggregating

1️⃣ Aggregation Functions

2️⃣ .agg()

3️⃣ .groupby()

⏭️ Next Song

💻 Example Code

📚 Further Reading

Share this: