Mathematics of Support Vector Machine

2019-05-06
Jun Sok Huhh | ๐Ÿ lostineconomics.com

Key Questions

Key Synopsis

SVM Mathematically

Preliminary concepts

Length of a vector

๋ฒกํ„ฐ x=(x1,x2,โ€ฆ,xn){\bf x} = (x_1, x_2, \dotsc, x_n) ์˜ ๋ฒกํ„ฐ์˜ ๊ธธ์ด, ์ฆ‰ ์œ ํด๋ฆฌ๋“œ ๋…ธ๋ฆ„(norm), ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜๋œ๋‹ค.

โˆฅxโˆฅ=x12+x22+โ€ฆ+xn2 \Vert {\bf x} \Vert = \sqrt{x_1^2 + x_2^2 + \dotsc + x_n^2}

Direction of a vector

๋ฒกํ„ฐ x\bf x์˜ ๋ฐฉํ–ฅ์„ฑ w\bf w๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜ํ•  ์ˆ˜ ์žˆ๋‹ค.

w=(x1โˆฅxโˆฅ,x2โˆฅxโˆฅ) \mathbf{w} = \left( \dfrac{x_1}{ \Vert \bf x \Vert }, \dfrac{x_2}{ \Vert \bf x \Vert } \right)

๊ทธ๋ฆผ์œผ๋กœ ๋‚˜ํƒ€๋‚ด๋ณด์ž.

์ด๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์‚ผ๊ฐํ•จ์ˆ˜๋กœ ํ‘œ์‹œํ•  ์ˆ˜ ์žˆ๋‹ค.
w=(cosโก ฮธ,cosโก ฮฑ) \mathbf{w} = \left( \cos~\theta, \cos~\alpha \right)

Dot product (inner product)

๋‚ด์ ์€ ๋ฒกํ„ฐ ์—ฐ์‚ฐ์˜ ์ผ์ข…์œผ๋กœ, ์ด๋Š” ๋‘ ๋ฒกํ„ฐ๋ฅผ ์Šค์นผ๋ผ ๊ฐ’์œผ๋กœ ๋ฐ”๊ฟ”์ฃผ๋Š” ์ผ์ข…์˜ ํ•จ์ˆ˜๋‹ค.

cosโกฮธ=cosโก(ฮฒโˆ’ฮฑ)=cosโกฮฒcosโกฮฑ+sinโกฮฒ sinโกฮฑ=x1โˆฅxโˆฅy1โˆฅyโˆฅ+x2โˆฅxโˆฅy2โˆฅyโˆฅ=x1y1+x2y2โˆฅxโˆฅโˆฅyโˆฅ \begin{aligned} \cos \theta & = \cos (\beta -\alpha) \\\\ & = \cos \beta \cos \alpha + \sin \beta \ \sin \alpha \\\\ & = \dfrac{x_1}{\Vert \rm x \Vert} \dfrac{y_1}{\Vert \rm y \Vert} + \dfrac{x_2}{\Vert \rm x \Vert} \dfrac{y_2}{\Vert \rm y \Vert} \\\\ & = \dfrac{x_1 y_1 + x_2 y_2}{\Vert \rm x \Vert \Vert \rm y \Vert} \end{aligned}

์ด๋ฅผ ์•„๋ž˜์™€ ๊ฐ™์ด ์ •๋ฆฌํ•  ์ˆ˜๋„ ์žˆ๋‹ค.

xโ‹…y=โˆฅxโˆฅโˆฅyโˆฅcosโกฮธ \rm x \cdot \rm y = \Vert \rm x \Vert \Vert \rm y \Vert \cos \theta

Hyperplane

nn ์ฐจ์› ๊ณต๊ฐ„์„ ๊ฐ€๋ฅผ ์ˆ˜ ์žˆ๋Š” ํ•ด๋‹น ๊ณต๊ฐ„์˜ ์ฐจ์›๋ณด๋‹ค ํ•˜๋‚˜ ๋‚ฎ์€ ์ˆ˜ํ•™์  ๊ด€๊ณ„๋ผ๊ณ  ํ’€์–ด์„œ ์“ธ ์ˆ˜ ์žˆ๋‹ค.

x=(x1,x2){\bf x} = (x_1, x_2)์˜ ๋ฒกํ„ฐ๊ฐ€ ์žˆ๋‹ค๊ณ  ํ•  ๋•Œ, ํ•˜์ดํผํ”Œ๋ ˆ์ธ์€ ๋ฒกํ„ฐ w\bf w์™€ bb์— ์˜ํ•ด ์ •์˜๋œ๋‹ค. ์ฆ‰,

wโ‹…x+b=0 \mathbf{w} \cdot {\bf x} + b = 0

Classifier

ํ•˜์ดํผํ”Œ๋ ˆ์ธ์„ ๊ธฐ์ค€์œผ๋กœ ํด๋ž˜์‹œํŒŒ์ด์–ด๋ฅผ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜ํ•œ๋‹ค. ํŠน์ •ํ•œ ๊ด€์ฐฐ ๋ฒกํ„ฐ x\bf x๊ฐ€ ์žˆ๋‹ค๊ณ  ํ•˜์ž. ์ด๋•Œ ๋ถ„๋ฅ˜ hh์˜ ์ •์˜๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

h(x)={+1ifwโ‹…x+bโ‰ฅ0โˆ’1ifwโ‹…x+b<0 h({\bf x}) = \begin{cases} +1\hspace{3em} & \text{if} \hspace{1em} \mathbf{w} \cdot {\bf x} + b \geq 0 \\\\ -1 \hspace{3em} & \text{if} \hspace{1em} \mathbf{w} \cdot {\bf x} + b < 0 \end{cases}

Explained Visually

๊ทธ๋ฆผ์œผ๋กœ ๋ณด๋‹ค ์ง๊ด€์ ์œผ๋กœ ์ดํ•ดํ•ด๋ณด์ž.

์–ด๋–ค ์›์ ์„ ๊ธฐ์ค€์œผ๋กœ training example๊นŒ์ง€์˜ ๋ฒกํ„ฐ๋ฅผ xi{\bf x}_i๋ผ๊ณ  ํ•˜์ž. ์ด๋•Œ ๋‘˜์„ ๊ฐ€๋ฅด๋Š” ํ•˜์ดํผํ”Œ๋ ˆ์ธ์ด ์žˆ์„ ๋•Œ ์ด์™€ ์ง๊ตํ•˜๋Š” ๋ฒกํ„ฐ (orthogonal vector) $\mathbf{w} $๋ฅผ ์ƒ๊ฐํ•ด๋ณด์ž. ์™œ orthogonalํ•ด์•ผ ํ•˜๋Š”๊ฐ€? ์ž ์‹œ ํ›„ ๊ทธ ์ด์œ ๋ฅผ ์•Œ ์ˆ˜ ์žˆ๋‹ค. ํ•˜์ดํผํ”Œ๋ ˆ์ธ์€ ๊ธฐ๋ณธ์ ์œผ๋กœ๋Š” ๋‘ ๋ฒกํ„ฐ ์‚ฌ์ด์˜ ๋‹ท ํ”„๋กœ๋•ํŠธ๋‹ค1. ๋‹ท ํ”„๋กœ๋•ํŠธ๋ฅผ ๊ทธ๋ฆผ์œผ๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์€ ์ด๋ฅผ projection์œผ๋กœ ์ƒ๊ฐํ•ด๋ณด๋Š” ๊ฒƒ์ด๋‹ค.

์ฆ‰, xi{\bf x}_i๋ฅผ $\mathbf{w} $๋กœ ํ”„๋กœ์ ์…˜์„ ํ•œ๋‹ค๋ฉด(projection of xi{\bf x}_i on $\mathbf{w} $), ์ด๋Š”

Projwxi=wโ‹…xiโˆฅwโˆฅ \text{Proj}_\mathbf{w} {\bf x}_i = \dfrac{\mathbf{w} \cdot{\bf x}_i}{\Vert \bf w \Vert}

๋‹ท ํ”„๋กœ๋•ํŠธ์˜ ๋ถ€๋ถ„์ด ์‹œ๊ฐ์ ์œผ๋กœ๋Š” projection ๊ฒฐ๊ณผ ๊ณฑํ•˜๊ธฐ โˆฅwโˆฅ\Vert \bf w \Vert๋กœ ๋‚˜ํƒ€๋‚œ๋‹ค. ์ฆ‰, xi{\bf x}_i์—์„œ w\bf w๋ฅผ ํ–ฅํ•ด ๋‚ด๋ฆฐ ์„ ๋ถ„์ด ํ”„๋กœ์ ์…˜์ด๊ณ  ์ด๋ฅผ โˆฅwโˆฅ\Vert \bf w \Vert๋กœ ์Šค์ผ€์ผ๋ง ํ•œ w\bf w ์œ„์—์„œ์˜ ๊ธธ์ด๊ฐ€ ๋‹ท ํ”„๋กœ๋•ํŠธ๋ฅผ ์‹œ๊ฐ์ ์œผ๋กœ ๋‚˜ํƒ€๋‚ธ ๊ฒƒ์ด๋‹ค. ์ด ํ”„๋กœ์ ์…˜์˜ ๊ธธ์ด์— ๋”ฐ๋ผ์„œ ํ•ด๋‹น ํŠธ๋ ˆ์ด๋‹ ์ƒ˜ํ”Œ์ด ์–ด๋–ค ๊ฒƒ์œผ๋กœ ๋ถ„๋ฅ˜๋ ์ง€์— ๊ด€ํ•ด์„œ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ๋‹ค. โˆฅwโˆฅ\bf \Vert w \Vert๊ฐ€ ๊ณ ์ •๋˜์–ด ์žˆ๋‹ค๊ณ  ํ•˜๋ฉด, ํ”„๋กœ์ ์…˜์˜ ํฌ๊ธฐ๊ฐ€ ์ผ์ • ์ˆซ์ž๋ณด๋‹ค ํฌ๋ฉด ๋ถ„๋ฅ˜์˜ ์˜ค๋ฅธ์ชฝ์— ์ž‘์œผ๋ฉด ๋ถ„๋ฅ˜์˜ ์™ผ์ชฝ์— ์œ„์น˜ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์ด๋ฅผ ์•„๋ž˜์™€ ๊ฐ™์ด ํ‘œ์‹œํ•ด๋ณด์ž.

wโ‹…xr+bโ‰ฅ1\mathbf{w} \cdot {\bf x}_{\mathrm r} + b \geq 1

wโ‹…xl+bโ‰ค1\mathbf{w} \cdot {\bf x}_{\mathrm l} + b \leq 1

ํ”„๋กœ์ ์…˜์˜ ๊ธธ์ด๊ฐ€ ์ผ์ •ํ•œ ๊ธฐ์ค€๋ณด๋‹ค ๊ธธ๋ฉด ์˜ค๋ฅธ์ชฝ์— ์งง์œผ๋ฉด ์™ผ์ชฝ์— ์œ„์น˜ํ•œ ๊ฒƒ์œผ๋กœ ๋ถ„๋ฅ˜ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด ์กฐ๊ฑด์„ yiy_i์™€ ํ•จ๊ป˜ ๋‚˜ํƒ€๋‚ด๋ณด์ž. ์ฆ‰,

yi(wโ‹…xi+b)โˆ’1โ‰ฅ0 y_i ( \mathbf{w} \cdot {\bf x}_ i + b) - 1 \geq 0

์•ž์„œ ๋ถ„๋ฅ˜๊ธฐ์—์„œ ํ•ด๋‹น ๊ฐ’์ด 0๋ณด๋‹ค ํฌ๋ฉด yi(wโ‹…xi+b)โˆ’1โ‰ฅ0y_i ( \mathbf{w} \cdot {\bf x}_ i + b) - 1 \geq 0๊ฐ€ ์„ฑ๋ฆฝํ•œ๋‹ค. ๋ฐ˜๋ฉด, ํ•ด๋‹น ๊ฐ’์ด 0๋ณด๋‹ค ์ž‘์œผ๋ฉด ์Œ์ˆ˜๋ฅผ ๊ณฑํ•˜๋Š” ๊ฒƒ์ด ๋˜์–ด ๋ถ€๋“ฑํ˜ธ๊ฐ€ ๋ฐ”๋€Œ๊ฒŒ ๋˜๊ณ , ์ด ๊ฒฝ์šฐ ์—ญ์‹œ ์œ„์˜ ์‹์ด ์„ฑ๋ฆฝํ•œ๋‹ค.

์ด์ œ cosโกฮธ\cos \theta๋ฅผ ๋ฒกํ„ฐ xsvrโˆ’xsvl{\bf x}_{\rm svr} - {\bf x} _{\rm svl}์™€ w\mathbf{w}๊ฐ€ ์ด๋ฃจ๋Š” ๊ฐ์ด๋ผ๊ณ  ์ƒ๊ฐํ•˜์ž. ์ด๋•Œ w\mathbf{w}๋Š” ํ•˜์ดํผํ”Œ๋ ˆ์ธ๊ณผ orthogonalํ•˜๋ฉฐ ์ ์ ˆํ•œ training sample ์ฆ‰, ์ ์ ˆํ•œ ํ•˜๋‚˜์˜ ์„œํฌํŠธ ๋ฒกํ„ฐ๋ฅผ ์ง€๋‚œ๋‹ค. ์ด๋•Œ cosโกฮธ\cos \theta๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์‰ฝ๊ฒŒ ์ •์˜๋œ๋‹ค.2

cosโกฮธ=(xsvrโˆ’xsvl)โ‹…wโˆฅxsvrโˆ’xsvlโˆฅโˆฅwโˆฅ\cos \theta = \dfrac{({\bf x} _ {\rm svr} - {\bf x} _ {\rm svl}) \cdot \bf w}{\Vert {\bf x} _ {\rm svr} - {\bf x} _ {\rm svl} \Vert \Vert \mathbf{w} \Vert}

ํ•œํŽธ, ํ•˜์ดํผํ”Œ๋ ˆ์ธ๊ณผ ํ‰ํ–‰ํ•˜๋ฉด์„œ ์„œํฌํŠธ ๋ฒกํ„ฐ๋ฅผ ์ง€๋‚˜๊ฐ€๋Š” ํ•˜์ดํผํ”Œ๋ ˆ์ธ์˜ ๊ฑฐ๋ฆฌ ฮ”x\Delta_{\bf x}๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

ฮ”xโˆฅxsvrโˆ’xsvlโˆฅ=cosโกฮธ=(xsvrโˆ’xsvl)โ‹…wโˆฅxsvrโˆ’xsvlโˆฅโˆฅwโˆฅ\dfrac{ \Delta _ {\bf x} }{\Vert {\bf x} _ {\rm svr} - {\bf x} _ {\rm svl} \Vert } = \cos \theta = \dfrac{({\bf x} _ {\rm svr} - {\bf x} _ {\rm svl}) \cdot \bf w}{\Vert {\bf x} _ {\rm svr} - {\bf x} _ {\rm svl} \Vert \Vert \mathbf{w} \Vert}

๋”ฐ๋ผ์„œ

ฮ”x=(xsvrโˆ’xsvl)โ‹…wโˆฅwโˆฅ\Delta _ {\bf x} = \dfrac{({\bf x} _ {\rm svr} - {\bf x} _ {\rm svl}) \cdot \bf w}{\Vert \mathbf{w} \Vert}

yi(wโ‹…xi+b)โˆ’1=0y_i ( \mathbf{w} \cdot {\bf x}_ i + b) - 1 = 0์˜ ์–‘๋ณ€์— yiy_i๋ฅผ ๊ณฑํ•˜๋ฉด, yi2(wโ‹…xi+b)=yiy_i^2 ( \mathbf{w} \cdot {\bf x}_ i + b) = y_i๊ฐ€ ๋œ๋‹ค. yi2=1y_i^2 =1์ด๋ฏ€๋กœ,

xsvrโ‹…w+b=1xsvlโ‹…w+b=โˆ’1 \begin{aligned} {\bf x}_ {\rm svr} \cdot \mathbf{w} + b & = 1 \\ {\bf x}_ {\rm svl} \cdot \mathbf{w} + b & = -1 \end{aligned}

์—ฌ๊ธฐ์„œ (xsvrโˆ’xsvl)โ‹…w=2({\bf x} _ {\rm svr} - {\bf x} _ {\rm svl}) \cdot \mathbf{w} = 2๋ฅผ ์‰ฝ๊ฒŒ ๋„์ถœํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ฒฐ๋ก ์ ์œผ๋กœ ๋‘ ์„œํฌํŠธ ๋ฒกํ„ฐ ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๋ฌธ์ œ๋Š” โˆฅwโˆฅ\Vert \bf w \Vert๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ๋ฌธ์ œ์™€ ๊ฐ™๋‹ค.

Optimization for SVM

Metrics to compare hyperplanes

Defining functional margin

fi=yi(wโ‹…xi+b)f_i = y_i(\mathbf{w} \cdot {\bf x}_i + b)๊ฐ€ ์žˆ๋‹ค๊ณ  ํ•˜์ž. ์ด๋•Œ ๋ถ„๋ฅ˜๊ฐ€ ์ œ๋Œ€๋กœ ์ด๋ฃจ์–ด์กŒ๋‹ค๋ฉด, fif_i์˜ ๋ถ€ํ˜ธ๋Š” ์–ธ์ œ๋‚˜ ์–‘์ˆ˜๋‹ค. ์œ„์˜ ๋ถ„๋ฅ˜์˜ ์ •์˜์— ๋”ฐ๋ฅด๋ฉด ๊ทธ๋ ‡๋‹ค. ๋ฐ์ดํ„ฐ ์…‹ DD์˜ ์ •์˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

D={(xi,yi)โˆฃxiโˆˆRn, yiโˆˆ{โˆ’1,1}}i=1m D = \left\lbrace ({\bf x}_i, y_i) \mid {\bf x}_ i \in \mathbb R^n,~y_ i \in \lbrace -1, 1\rbrace \right\rbrace_{i=1}^m

ํŽ‘์…”๋„ ๋งˆ์ง„(functional margin)์ด๋ผ๊ณ  ๋ถˆ๋ฆฌ๋Š” FF ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

F=minโกi=1,โ€ฆ,myi(wโ‹…xi+b) F = \min_{i = 1, \dotsc, m} y_i( \mathbf{w} \cdot {\bf x} _i + b )

w\mathbf{w}์™€ bb๋กœ ์ •์˜๋˜๋Š” ํ•˜์ดํผํ”Œ๋ ˆ์ธ์ด ๋ชจ๋“  ํŠธ๋ ˆ์ด๋‹ ์…‹์„ ์ž˜ ๋ถ„๋ฅ˜ํ–ˆ๋‹ค๋ฉด, fi>0f_i > 0๊ฐ€ ์„ฑ๋ฆฝํ•œ๋‹ค. ์ด fif_i ์ค‘ ๊ฐ€์žฅ ์ž‘์€ ๊ฐ’์ด functional margin์ด๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋‘ ๋ฒˆ์งธ๋กœ ์„œ๋กœ ๋‹ค๋ฅธ ํ•˜์ดํผํ”Œ๋ ˆ์ธ ์ค‘์—์„œ ๊ฐ€์žฅ ํฐ FF๋ฅผ ์ง€๋‹ˆ๋Š” ํ•˜์ดํผํ”Œ๋ ˆ์ธ์ด ์ตœ์ ์ด ํ•˜์ดํผํ”Œ๋ ˆ์ธ์ด๋‹ค.

  1. FF๋ฅผ ์–ป๊ธฐ ์œ„ํ•œ ๊ณผ์ •์—์„œ ์ตœ์†Œํ™” ๋กœ์ง์ด ๋“ค์–ด๊ฐ„๋‹ค. ์ฆ‰, ํ•ด๋‹น ํ•˜์ดํผํ”Œ๋ ˆ์ธ๊ณผ ๊ฐ€์žฅ ๊ฐ€๊น๊ฒŒ ์œ„์น˜ํ•œ ๊ด€์ธก์น˜๋ฅผ ์–ป์–ด๋‚ด๋Š” ๊ณผ์ •
  2. ์ด๋ ‡๊ฒŒ ์–ป์–ด๋‚ธ FF๋“ค์„ ์„œ๋กœ ๋‹ค๋ฅธ ํ•˜์ดํผํ”Œ๋ ˆ์ธ๋“ค ์‚ฌ์ด์— ๋น„๊ตํ•˜๊ณ , ๊ฐ€์žฅ ํฐ FF๋ฅผ ์ฃผ๋Š” ํ•˜์ดํผํ”Œ๋ ˆ์ธ์„ ์ฑ„ํƒํ•œ๋‹ค.

ํ‘œ์ค€ํ™”๋ฅผ ์œ„ํ•ด์„œ w\bf w์˜ norm์œผ๋กœ fif_i ๊ฐ’์„ ๋‚˜๋ˆ„๊ณ  ๊ทน๋Œ€ํ™” ๋ฌธ์ œ๋ฅผ ์ •์‹ํ™”ํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

Derivation of SVM optimization problem

ํ‘œ์ค€ํ™”๋ฅผ ์œ„ํ•ด์„œ โˆฅwโˆฅ\Vert \bf w \Vert๋กœ ๋ชฉ์ ํ•จ์ˆ˜์™€ ์ œ์•ฝ์„ ๋‚˜๋ˆ„์ž.

maxโกw,bMs.t.ฮณiโ‰ฅMfori=1,โ€ฆ,m \max_{\mathbf{w} , b} M\hspace{1em}\text{s.t.}\hspace{1em}\gamma_i \geq M\hspace{1em}\text{for}\hspace{1em}i = 1,\dotsc, m

where

ฮณi=yi(wโˆฅwโˆฅโ‹…xi+bโˆฅwโˆฅ) \gamma_i = y_i \left( \dfrac{\mathbf{w} }{\Vert \mathbf{w} \Vert} \cdot {\bf x}_i + \dfrac{b}{\Vert \mathbf{w} \Vert} \right)

M=minโกi=1,โ€ฆ,mฮณi M = \min_{i=1, \dotsc, m} \gamma_i

ํ‘œ์ค€ํ™”๋œ ํŽ‘์…”๋„ ๋งˆ์ง„์„ ์ตœ๋Œ€ํ™”ํ•˜๋˜, ํŠธ๋ ˆ์ด๋‹ ์ƒ˜ํ”Œ๋“ค์ด ์ด๊ฒƒ๋ณด๋‹ค ์ปค์•ผ ํ•œ๋‹ค๋Š” ์กฐ๊ฑด(์ตœ์†Œํ™”)์ด ์ œ์•ฝ์œผ๋กœ ๋“ค์–ด๊ฐ„๋‹ค. ์ฆ‰, ์•„๋ž˜์˜ ์‹์€ ์ตœ์†Œํ™” ์ œ์•ฝ ํ•˜์—์„œ FF๋ฅผ ์ตœ๋Œ€ํ™”ํ•œ๋‹ค๋Š” ์ด์ค‘ ์ตœ์ ํ™” ๊ณผ์ •์„ ๋ณด์—ฌ ์ค€๋‹ค.

maxโกw,bFโˆฅwโˆฅs.t.fiโ‰ฅFfori=1,2,โ€ฆ,m \max_{\mathbf{w} , b} \dfrac{F}{\Vert w \Vert}\hspace{1em}\text{s.t.}\hspace{1em}f_i \geq F \hspace{1em}\text{for}\hspace{1em}i = 1,2, \dotsc, m

์œ„ ๊ทน๋Œ€ํ™” ๋ฌธ์ œ์—์„œ ๋ชจ๋“  ๋ณ€์ˆ˜๋Š” ์ƒ๋Œ€๊ฐ’์œผ๋กœ ์ •์˜ํ•  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ FF๋ฅผ 1๋กœ ์ œํ•œํ•ด๋„ ํ•ด๋Š” ๋ฐ”๋€Œ์ง€ ์•Š๋Š”๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์•„๋ž˜์™€ ๊ฐ™์€ ์ฐจ๋ก€๋กœ ์ •์‹ํ™”ํ•  ์ˆ˜ ์žˆ๋‹ค.

maxโกw,b1โˆฅwโˆฅs.t.fiโ‰ฅ1fori=1,2,โ€ฆ,m \max_{\mathbf{w} , b} \dfrac{1}{\Vert w \Vert}\hspace{1em}\text{s.t.}\hspace{1em}f_i \geq 1\hspace{1em}\text{for}\hspace{1em}i = 1,2, \dotsc, m

minโกw,bโˆฅwโˆฅs.t.fiโ‰ฅ1fori=1,2,โ€ฆ,m \min_{\mathbf{w} , b} {\Vert w \Vert}\hspace{1em}\text{s.t.}\hspace{1em}f_i \geq 1\hspace{1em}\text{for}\hspace{1em}i = 1,2, \dotsc, m

minโกw,b12โˆฅwโˆฅ2s.t.fiโ‰ฅ1fori=1,2,โ€ฆ,m \min_{\mathbf{w} , b} \dfrac{1}{2}{\Vert w \Vert}^2\hspace{1em}\text{s.t.}\hspace{1em}f_i \geq 1\hspace{1em}\text{for}\hspace{1em}i = 1,2, \dotsc, m

Optimization by Wolfe duality

์ œ์•ฝ ํ•˜์˜ ๊ทน๋Œ€ํ™” ๋ฌธ์ œ์ด๋ฏ€๋กœ ๋ผ๊ทธ๋ž‘์ฃผ ์ตœ์ ํ™”๋กœ ๋ฐ”๋€Œ์„œ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋ผ๊ทธ๋ž‘์ฃผ ๋ฐฉ์ •์‹์„ ์ •์˜ํ•˜์ž.

L(w,b,ฮฑ)=12wโ‹…wโˆ’โˆ‘i=1mฮฑi[yi(wโ‹…x+b)โˆ’1] {\mathcal L}(\mathbf{w} , b, {\boldsymbol \alpha}) = \frac{1}{2} \mathbf{w} \cdot \mathbf{w} - \sum_{i=1}^m \alpha_i \left [ y_i (\mathbf{w} \cdot {\bf x} + b) -1 \right]

์—ฌ๊ธฐ์„œ ๋ฒกํ„ฐ ฮฑ\boldsymbol \alpha๋Š” ๋ผ๊ทธ๋ž‘์ฃผ ์ตœ์ ํ™”์˜ ๋ผ๊ทธ๋ž‘์ฃผ ์Šน์ˆ˜๋กœ ์ œ์•ฝ์‹์„ ๋ฐ˜์˜ํ•˜๋Š” ๋ถ€๋ถ„์ด๋‹ค.
์ผ๋‹จ, ฮฑi\alpha_i๋ฅผ ๋ฌด์‹œํ•˜๊ณ  ๋‘ ์ตœ์ ํ™” ๋ณ€์ˆ˜์ธ w\bf w์™€ bb ์— ๋Œ€ํ•ด์„œ๋ฉด 1๊ณ„ ์กฐ๊ฑด์„ ํ’€๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

KaTeX parse error: Expected '}', got '&' at position 89: โ€ฆmbol \alpha}) =&ฬฒ \mathbf{w} - \โ€ฆ

์ด ๋…€์„๋“ค์„ ๋‹ค์‹œ ๋ผ๊ทธ๋ž‘์ฃผ ๋ฐฉ์ •์‹์— ๋„ฃ์œผ๋กœ๋งŒ ฮฑ\boldsymbol \alpha๋กœ๋งŒ ๋œ ์ผ์ข…์˜ maximal function ํ˜น์€ ๋ผ๊ทธ๋ž‘์ฃผ ๋ฐฉ์ •์‹์˜ ํ•˜ํ•œ(infimum)์„ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค. Wolfe duality์— ๋”ฐ๋ฅด๋จผ, ์ตœ์†Œํ™” ์ตœ๋Œ€ํ™”๋ฅผ ํ•œ ๋ฒˆ์— ํ‘ธ๋Š” ๊ฒƒ๊ณผ ํ•œ ๊ฐ€์ง€ ๋ฌธ์ œ๋ฅผ ๋จผ์ € ํ‘ผ ํ›„ ํ•ด๋‹น ๊ฒฐ๊ณผ๋ฅผ ๋ชฉ์  ํ•จ์ˆ˜์— ๋„ฃ๊ณ  ๋‘ ๋ฒˆ์งธ ๋ฌธ์ œ๋ฅผ ์ˆœ์ฐจ์ ์œผ๋กœ ํ‘ธ๋Š” ๊ฒƒ์ด ๋™์ผํ•˜๋‹ค. ๋”ฐ๋ผ์„œ ์œ„์˜ ์ผ๊ณ„ ์กฐ๊ฑด์„ ๋ชฉ์  ํ•จ์ˆ˜์— ๋„ฃ์€ ๋ชฉ์ ํ•จ์ˆ˜๋Š” ๊ตฌํ•ด๋ณด์ž.

W(ฮฑ)=โˆ‘i=1mฮฑiโˆ’12โˆ‘i=1mโˆ‘j=1mฮฑiฮฑjyiyjxiโ‹…yj W(\boldsymbol \alpha) = \sum_{i=1}^m \alpha_i - \dfrac{1}{2}\sum_{i=1}^{m} \sum_{j=1}^m \alpha_i \alpha_j y_i y_j {\bf x}_i \cdot {\bf y}_j

์ด์ œ ๋ฌธ์ œ๋Š” ฮฑ\boldsymbol \alpha์— ๊ด€ํ•ด์„œ ๊ทน๋Œ€ํ™” ๋ฌธ์ œ๋ฅผ ํ‘ธ๋Š” ๊ฒƒ์œผ๋กœ ๋ฐ”๋€๋‹ค. ์ฆ‰,

maxโกฮฑW(ฮฑ)s.t.ฮฑiโ‰ฅ0,โˆ‘i=1mฮฑi(yi(wโ‹…xโˆ—+b)โˆ’1)=0 \max_{\boldsymbol \alpha} W( {\boldsymbol \alpha} )\hspace{1em}\text{s.t.}\hspace{1em}{\alpha_i} \geq 0, \sum_{i=1}^m \alpha_i { \left( y_i ( \mathbf{w} \cdot {\bf x}^* + b) -1 \right)} = 0

์ œ์•ฝ ๋ถ€๋ถ„์ด ๋ถ€๋“ฑ์‹์ด๋ฏ€๋กœ KKT ์กฐ๊ฑด์— ๋”ฐ๋ผ์„œ ํ’€๋ฉด ๋œ๋‹ค.

ฮฑi[yi(wโ‹…xโˆ—+b)โˆ’1]=0 \alpha_i \left[ y_i (\mathbf{w} \cdot {\bf x}^* + b) -1 \right] = 0

KKT ์กฐ๊ฑด์ด๋ž€ ๋ถ€๋“ฑ์‹ ์ œ์•ฝ์„ ํ‘ธ๋Š” ํ…Œํฌ๋‹‰์ด๋‹ค. ์ฆ‰, ฮฑi>0\alpha_i >0์˜ ์ œ์•ฝ์ด ์œ ํšจํ•˜๋‹ค๋ฉด ์ œ์•ฝ์„ ๋งŒ์กฑ์‹œํ‚ค๊ธฐ ์œ„ํ•ด์„œ๋Š” yi(wโ‹…xโˆ—+b)โˆ’1=0y_i (\mathbf{w} \cdot {\bf x}^* + b) -1 = 0์ด ๋งŒ์กฑํ•ด์•ผ ํ•œ๋‹ค. ์ด๋ ‡๊ฒŒ ์ œ์•ฝ์ด ๊ฑธ๋ฆฌ๋Š” ๊ฒฝ์šฐ์— ์œ„์น˜ํ•œ xโˆ—x^*๊ฐ€ ๋ฐ”๋กœ '์„œํฌํŠธ ๋ฒกํ„ฐโ€™๋‹ค. ๋ฐ˜๋ฉด, ฮฑi=0\alpha_i =0๋Š” ์ œ์•ฝ์ด ๋“ฑํ˜ธ๋กœ ๊ฑธ๋ฆด ํ•„์š”๊ฐ€ ์—†๋Š” ํŠธ๋ ˆ์ด๋‹ ์…‹์˜ ๊ด€์ฐฐ๋“ค์ด๋‹ค. ์ด๋“ค์€ ๋ถ„๋ฅ˜ ํ•˜์ดํผํ”Œ๋ ˆ์ธ๊นŒ์ง€์˜ ๊ธธ์ด๊ฐ€ ์„œํฌํŠธ ๋ฒกํ„ฐ์˜ ๊ธธ์ด๋ณด๋‹ค ํฌ๋‹ค.

Compute w\bf w and bb

w\bf w ์˜ ๊ฒฝ์šฐ 1๊ณ„ ์กฐ๊ฑด์—์„œ ์‰ฝ๊ฒŒ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค.

wโˆ’โˆ‘i=1mฮฑiyixi=0 \mathbf{w} - \sum_{i=1}^m \alpha_i y_i {\bf x} _i = 0

ํ•œํŽธ, bb์˜ ๊ฒฝ์šฐ ์„œํฌํŠธ ๋ฒกํ„ฐ์˜ ๊ฒฝ์šฐ ์œ„์—์„œ ๋ณธ ๊ฒƒ ๊ฐ™์ด ์ œ์•ฝ ์‹์˜ ๋“ฑํ˜ธ๊ฐ€ ์„ฑ๋ฆฝํ•œ๋‹ค. ์ฆ‰, ์„œํฌํŠธ ๋ฒกํ„ฐ๋ฅผ xโˆ—x^*๋ผ๊ณ  ํ•  ๋•Œ,

yi(wโ‹…xโˆ—+b)โˆ’1=0 y_i (\mathbf{w} \cdot {\bf x}^* + b) -1 = 0

b=yiโˆ’wโ‹…xโˆ— b = y_i - \mathbf{w} \cdot {\bf x}^*

b=1Sโˆ‘i=1S(yiโˆ’wโ‹…xiโˆ—) b = \dfrac{1}{S} \sum_{i=1}^S \left( y_i - \mathbf{w} \cdot {\bf x}^*_i \right)

Limitation

์•ž์„œ ๋ณธ ๊ฒƒ์„ ์ด๋ฅธ๋ฐ” hard margin SVM์ด๋‹ค. ์ฆ‰, ์„œํฌํŠธ ๋ฒกํ„ฐ ์‚ฌ์ด์˜ ๋งˆ์ง„์„ ํš์ผ์ ์œผ๋กœ ์ ์šฉํ•˜๋Š” ๋ถ„๋ฅ˜ ์•Œ๊ณ ๋ฆฌ๋“ฌ์ด๋‹ค. ํ•˜๋“œ ๋งˆ์ง„ SVM์€ ๋‹ค์Œ์˜ ๋‘๊ฐ€์ง€ ๊ฒฝ์šฐ์— ์ทจ์•ฝํ•˜๋‹ค.

๋ฐ์ดํ„ฐ์— ๋…ธ์ด์ฆˆ๊ฐ€ ์žˆ์„ ๊ฒฝ์šฐ

ํ•˜๋“œ ๋งˆ์ง„์˜ ํฐ ๋ฌธ์ œ๋Š” ์•„์›ƒ๋ผ์ด์–ด์— ์ทจ์•ฝํ•  ์ˆ˜ ๋ฐ–์— ์—†๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ํ’€์–ด์„œ ๋งํ•˜๋ฉด ์ œ์•ฝ์ด ์„ ํ˜•์ด๊ณ  ๊ทธ ์ œ์•ฝ์ด ๊ฐ•๋ ฅํ•˜๋‹ค๋Š” ๋ฐ ์žˆ๋‹ค. ํŠธ๋ ˆ์ด๋‹ ๋ฐ์ดํ„ฐ์— ์ด๋Ÿฐ ์ €๋Ÿฐ ๋…ธ์ด์ฆˆ๊ฐ€ ์žˆ์„ ๊ฒฝ์šฐ ํ•˜๋“œ ๋งˆ์ง„์€ ์•„์˜ˆ ๊ณ„์‚ฐ์ด ๋ถˆ๊ฐ€๋Šฅํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋•Œ ์‚ฌ์šฉํ•˜๋Š” ๊ธฐ๋ฒ•์ด soft margin SVM์ด๋‹ค.

๋ฐ์ดํ„ฐ ์ž์ฒด๊ฐ€ ์„ ํ˜•์œผ๋กœ ๋ถ„๋ฆฌ๋˜์ง€ ์•Š์„ ๊ฒฝ์šฐ

์• ์ดˆ๋ถ€ํ„ฐ ๋ฐ์ดํ„ฐ ์ž์ฒด๊ฐ€ ์„ ํ˜•์„ ํ†ตํ•œ ๋ถ„๋ฅ˜๋ฅผ ํ—ˆ์šฉํ•˜์ง€ ์•Š์„ ๊ฒฝ์šฐ์—๋Š” ๋น„์„ ํ˜• ๋ถ„๋ฅ˜๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋•Œ ๋™์›ํ•˜๋Š” ํ…Œํฌ๋‹‰์ด ์ปค๋„ ํŠธ๋ฆญ (kernel trick)์ด๋‹ค.

Soft Margin SVM

์ œ์•ฝ์„ ์•ฝ๊ฐ„ ํ’€์–ด์ฃผ๋Š” ฮถ\zeta๋ฅผ ๋„์ž…ํ•˜์—ฌ ์ตœ์ ํ™” ๋ฌธ์ œ๋ฅผ ์ •์‹ํ™”ํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

minโกw,b,ฮถ12โˆฅwโˆฅ2+Cโˆ‘i=1mฮถi s.t yi(wโ‹…xi+b)โ‰ฅ1โˆ’ฮถi for i=1,2,โ€ฆ,m \min_{\mathbf{w}, b, {\boldsymbol \zeta}} \dfrac{1}{2} \Vert \mathbf{w} \Vert^2 + C \sum_{i=1}^m \zeta_i~\text{s.t}~ y_i ( \mathbf{w} \cdot {\bf x}_i + b) \geq 1 - \zeta_i~\text{for}~ i = 1,2,\dotsc, m

๋ฌธ์ œ๋ฅผ ํ’€๋ฉด

maxโกฮฑโˆ‘i=1mฮฑiโˆ’12โˆ‘i=1mโˆ‘j=1mฮฑiฮฑjyiyjxiโ‹…xjs.t. \max_{\boldsymbol \alpha} \sum_{i=1}^m \alpha_i - \dfrac{1}{2} \sum_{i=1}^m \sum_{j=1}^{m} \alpha_i \alpha_j {y_i} {y_j} {\bf x}_i \cdot {\bf x}_j \hspace{1em} \text{s.t.}

0โ‰คฮฑiโ‰คCfori=1,2,โ€ฆ,m 0 \leq \alpha_i \leq C\hspace{1em}\text{for}\hspace{1em}i=1,2,\dotsc, m

โˆ‘i=1mฮฑiyi=0 \sum_{i=1}^m \alpha_i y_i = 0

What is CC

์ •๊ทœํ™” ํŒŒ๋ผ๋ฏธํ„ฐ์ธ CC๋ฅผ ์–ด๋–ป๊ฒŒ ์ดํ•ดํ• ๊นŒ? ์‹ ๊ทธ๋Œ€๋กœ ๋ณด๋ฉด, ฮถ\boldsymbol \zeta๋ฅผ ์–ผ๋งˆ๋‚˜ ์ค‘์š”ํ•˜๊ฒŒ ์ตœ์ ํ™” ๋ฌธ์ œ์— ๋ฐ˜์˜ํ•  ๊ฒƒ์ธ์ง€๊ฐ€ ์ค‘์š”ํ•˜๋‹ค. ๋‹ค์‹œ ๋งํ•˜๋ฉด ์ด๋Š” ์—๋Ÿฌ๋ฅผ ์–ด๋–ป๊ฒŒ ๋ณผ ๊ฒƒ์ธ๊ฐ€์™€ ์—ฐ๊ฒฐ๋˜์–ด ์žˆ๋‹ค. ๋งŒ์ผ CC๊ฐ€ ์–‘์˜ ๋ฌดํ•œ๋Œ€ ๊ฐ’์ด๋ผ๋ฉด ์ด๋Š” ํ•˜๋“œ ๋งˆ์ง„๊ณผ ๋™์ผํ•˜๋‹ค. ๋งŒ์ผ C=0C=0 ์ด๋ผ๋ฉด ฮฑi=0\alpha_i=0๊ฐ€ ๋œ๋‹ค. ์ด๋Š” ์‚ฌ์‹ค์ƒ ์„ ํ˜• ์ œ์•ฝ์ด ์‚ฌ๋ผ์ง€๊ฒŒ ๋˜๋Š” ๊ฒฐ๊ณผ๊ฐ€ ๋œ๋‹ค. ๋”ฐ๋ผ์„œ ํ•˜์ดํผํ”Œ๋ ˆ์ธ์ด ์–ด๋–ค ๊ฒƒ๋„ ๋ถ„๋ฅ˜ํ•˜๊ธฐ ๋ชปํ•˜๊ฒŒ ๋  ๊ฒƒ์ด๋‹ค. ์ฆ‰, C๋Š” ํ•˜๋“œ ๋งˆ์ง„๊ณผ ์†Œํ”„ํŠธ ๋งˆ์ง„ ์‚ฌ์ด๋ฅผ ์ ์ ˆํ•˜๊ฒŒ ์กฐ์ ˆํ•˜๋Š” ์—ญํ• ์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.

Kernel Trick

๋งˆ์ง€๋ง‰์— ์–ป์€ ๊ทน๋Œ€ํ™” ๋ฌธ์ œ๋ฅผ ์‚ดํŽด๋ณด๋ฉด, training set์ด ๊ด€๋ จ๋œ ๋ถ€๋ถ„์ด ๋”ฑ ํ•˜๋‚˜ ๋ฐ–์— ์—†๋‹ค. xiโ‹…xj{\bf x}_i \cdot {\bf x}_j ๋ฟ์ด๋‹ค. ๋”ฐ๋ผ์„œ ์ด ๋ถ€๋ถ„์„ ์œ ์—ฐํ•˜๊ฒŒ ์กฐ์ •ํ•ด์คŒ์œผ๋กœ์จ ๋น„์„ ํ˜•์  ํ˜•ํƒœ์˜ ๋ถ„๋ฅ˜๋ฅผ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์ด๋‹ค. ์•ž์„œ ๋ดค๋˜ ํ•˜๋“œ ๋งˆ์ง„ SVM์€ ์„ ํ˜• ์ปค๋„์„ ์‚ฌ์šฉํ•œ๋‹ค. ์ฆ‰,

K(xi,xj)=xiโ‹…xj K({\bf x}_i, {\bf x}_j) = {\bf x}_i \cdot {\bf x}_j

์ด๋Ÿฐ ์ปค๋„๋งŒ ์žˆ์„ ํ˜•ํƒœ๋Š” ์—†๋‹ค. ์—ฌ๋Ÿฌ๊ฐ€์ง€ ํ•จ์ˆ˜๋ฅผ ์ปค๋„๋กœ ์ทจํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋‹ค. ๊ฐ€์žฅ ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” ์ปค๋„์€ RBF(Radial Basis Function) ํ˜น์€ ๊ฐ€์šฐ์‹œ์•ˆ์ด๋ผ๊ณ  ํ•œ๋‹ค.

K(xi,xj)=expโก(โˆ’ฮณโˆฅxiโˆ’xjโˆฅ) K({\bf x}_i, {\bf x}_j) = \exp \left( -\gamma \Vert {\bf x}_i - {\bf x}_j \Vert \right)

Resource

์ด ์ž๋ฃŒ๋Š” ์•„๋ž˜ ๋‚ด์šฉ์„ ๋ฐ”ํƒ•์œผ๋กœ ๋งŒ๋“ค์–ด์กŒ์Šต๋‹ˆ๋‹ค.

๐ŸพJun Sok Huhh | ๐Ÿ lostineonomics.com


  1. ๋‚ด์ ์ด๋ผ๊ณ  ๋ฒˆ์—ญ๋˜๊ธฐ๋„ ํ•˜์ง€๋งŒ ์—ฌ๊ธฐ์„œ๋Š” ๊ทธ๋ƒฅ '๋‹ท ํ”„๋กœ๋•ํŠธโ€™๋ผ๊ณ  ์“ฐ๋ฆฌ๊ณ  ํ•˜๊ฒ ๋‹ค. โ†ฉ๏ธŽ

  2. ๋ฒกํ„ฐ์˜ ๋ฐฉํ–ฅ์— ๋Œ€ํ•ด์„œ ์•ฝ๊ฐ„ ๊ฐธ์šฐ๋šฑํ•˜๋Š” ๋ถ„๋“ค์ด ์žˆ์„์ง€ ๋ชจ๋ฅด๊ฒ ๋‹ค. cosโกฮธ\cos \theta๋ฅผ ์ œ๋Œ€๋กœ ์ •์˜ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” โˆ’(xsvrโˆ’xsvl)-({\bf x}_{\rm svr} - {\bf x}_{\rm svl}), โˆ’w- \mathbf{w}๋ผ๊ณ  ์“ฐ๋Š” ๊ฒƒ์ด ๋งž์„ ๊ฒƒ์ด๋‹ค. ํ•˜์ง€๋งŒ, ๋‘˜์˜ ๋‹ท ํ”„๋กœ๋•ํŠธ๋ฅผ ๊ตฌํ•˜๋ฉด ์„œ๋กœ ์ƒ์‡„๋˜์–ด ์•„๋ž˜ ์ ์€ ๊ฒƒ๊ณผ ๋™์ผํ•˜๋‹ค. โ†ฉ๏ธŽ