[NLP] Sentence-BERT

NLPSBERTBERT
avatar
2025.04.24
ยท
6 min read

๐Ÿ”ท BERT

BERT = Bidirectional Encoder Representations from transformers
์‚ฌ๋žŒ์ฒ˜๋Ÿผ ๋ฌธ์žฅ์„ ์ดํ•ดํ•˜๋ ค๊ณ  ๋งŒ๋“  ์ธ๊ณต์ง€๋Šฅ ์–ธ์–ด ๋ชจ๋ธ

  • ๊ธฐ์กด์˜ AI ๋ชจ๋ธ์€ ๋‹จ๋ฐฉํ–ฅ์œผ๋กœ ์ดํ•ดํ•จ

  • BERT๋Š” ๋ฌธ์žฅ์„ ์–‘๋ฐฉํ–ฅ์œผ๋กœ ์ฝ์–ด์„œ ๋” ๊นŠ์ด ์ดํ•ดํ•จ

๐Ÿ”ท BERT ์ž‘๋™๋ฐฉ์‹

Transformer ๋ผ๋Š” ๋”ฅ๋Ÿฌ๋‹ ๊ตฌ์กฐ ์‚ฌ์šฉํ•ด์„œ ์ž‘๋™

๐Ÿ”ท ํ›ˆ๋ จ ๋ฐฉ๋ฒ• ์˜ˆ์‹œ

  1. ๋ฌธ์žฅ์—์„œ ๋ช‡ ๊ฐœ ๋‹จ์–ด๋ฅผ ๋งˆ์Šคํ‚น ์ฒ˜๋ฆฌ
    ๋‚˜๋Š” [mask]์„ ๋งˆ์…จ๋‹ค.

  2. ๋ชจ๋ธ์ด ๋นˆ์นธ์— ๋“ค์–ด๊ฐˆ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธก
    ์ปคํ”ผ

  3. ์ •๋‹ต ๋งžํžˆ๊ธฐ ํ›ˆ๋ จ์„ ์ˆ˜์ฒœ๋งŒ ๊ฐœ ๋ฌธ์žฅ์œผ๋กœ ๋ฐ˜๋ณต ํ•™์Šต

๐Ÿ”ท BERT ์žฅ์ 

๊ธฐ๋Šฅ์˜ˆ์‹œ

๋ฌธ์žฅ ์ดํ•ด

"์ง€ํ•˜์ฒ ์—ญ ๊ทผ์ฒ˜ ๋ฐฉ" <-> "์—ญ์„ธ๊ถŒ ์›๋ฃธ" -> ์˜๋ฏธ ํŒŒ์•… ๊ฐ€๋Šฅ

์งˆ๋ฌธ์— ๋‹ตํ•˜๊ธฐ

"๋Œ€ํ•œ๋ฏผ๊ตญ์˜ ์ˆ˜๋„๋Š”?" -> "์„œ์šธ"

๋ฌธ์žฅ ๊ฐ„ ๊ด€๊ณ„ ํŒŒ์•…

"๊ทธ๋Š” ๋ฐฅ์„ ๋จน์—ˆ๋‹ค. ๊ทธ๋Š” ๋ฐฐ๊ฐ€ ๊ณ ํŒ ๋‹ค." -> ๋…ผ๋ฆฌ ์—ฐ๊ฒฐ ์ดํ•ด

๋ฌธ์žฅ ์š”์•ฝ

๊ธด ๋ฌธ์žฅ -> ํ•ต์‹ฌ๋งŒ ์ •๋ฆฌ ๊ฐ€๋Šฅ(fine-tuning ํ•„์š”)

5509

๐Ÿ”ท SBERT

Sentence-BERT
๋ฌธ์žฅ ๊ฐ„ ์˜๋ฏธ ๋น„๊ต๋ฅผ ์œ„ํ•ด ๋‚˜์˜จ ๋ชจ๋ธ

โœ… ๊ธฐ๋ณธ ๊ฐœ๋…

๊ธฐ์กด BERT
- ๋‘ ๋ฌธ์žฅ์˜ ๊ด€๊ณ„๋งŒ ํ•™์Šต
- ๋ฌธ์žฅ ์ž์ฒด๋ฅผ ๋ฒกํ„ฐ๋กœ ํ˜„ํ™˜ํ•˜๋Š” ๋ฐ๋Š” ์ ํ•ฉํ•˜์ง€ ์•Š์Œ
-> SBERT
- ๋ฌธ์žฅ ํ•˜๋‚˜ -> ๊ณ ์ • ๊ธธ์ด์˜ ์˜๋ฏธ ๋ฒกํ„ฐ(์ž„๋ฒ ๋”ฉ)
- ๋ฌธ์žฅ A์™€ B -> ๋ฒกํ„ฐ ๊ฑฐ๋ฆฌ๋กœ ๋น„๊ต ๊ฐ€๋Šฅ(์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„ ๋“ฑ)

๐Ÿ”ท ์˜ˆ์‹œ ์ฝ”๋“œ

๐Ÿ›  ์„ค์น˜

pip install sentence-transformers

๐Ÿ“ฆ ๋ชจ๋ธ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('sentence-transformers/paraphrase-multiligual-MiniLM-L12-v2')

| โœ… ์œ„ ๋ชจ๋ธ์€ ํ•œ๊ตญ์–ด ํฌํ•จ ๋‹ค๊ตญ์–ด ์ง€์› + ๋น ๋ฅด๊ณ  ๊ฐ€๋ฒผ์›€

**โœ ๋ฌธ์žฅ ๋ฒกํ„ฐํ™” (์ž„๋ฒ ๋”ฉ)

sentence = "๊ฐ•๋‚จ์— ์žˆ๋Š” ์›”์„ธ 1000 ์ดํ•˜ ์›๋ฃธ ์ถ”์ฒœํ•ด์ค˜"
embedding = model.encode(sentence)
print(embedding.shape)  # (384,) ๋ฒกํ„ฐ ์ฐจ์› ์ˆ˜
  • ์ถœ๋ ฅ๊ฐ’: ๋ฒกํ„ฐ (list๋‚˜ numpy ๋ฐฐ์—ด์ฒ˜๋Ÿผ ์ƒ๊น€)

  • ๋ฒกํ„ฐ๋ผ๋ฆฌ์˜ ์œ ์‚ฌ๋„ ๋น„๊ตํ•˜๋ฉด ๋ฌธ์žฅ ์˜๋ฏธ ๋น„๊ต ๊ฐ€๋Šฅ

๐Ÿ”ท ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„ ๊ณ„์‚ฐ ์˜ˆ์‹œ

from sklearn.metrics.pairwise import cosine_similarity

s1 = "์—ญ์„ธ๊ถŒ ์กฐ์šฉํ•œ ์›๋ฃธ"
s2 = "์ง€ํ•˜์ฒ  ๊ทผ์ฒ˜ ์กฐ์šฉํ•œ ๋ฐฉ"
s3 = "๋ณต์ธต ์˜คํ”ผ์Šคํ…”, ๋„“์€ ๊ฑฐ์‹ค"

emb1 = model.encode(s1)
emb2 = model.encode(s2)
emb3 = model.encode(s3)

print("์œ ์‚ฌ๋„ (1 vs 2):", cosine_similarity([emb1], [emb2])[0][0])
print("์œ ์‚ฌ๋„ (1 vs 3):", cosine_similarity([emb1], [emb3])[0][0])

์˜ˆ์ƒ ๊ฒฐ๊ณผ:

  • 1 vs 2: 0.85 = ๋น„์Šทํ•œ ์˜๋ฏธ

  • 1 vs 3: 0.40 = ๋œ ๋น„์Šทํ•จ

๐Ÿ”ท ์ „์ฒด ํ๋ฆ„ ์š”์•ฝ

  1. SBERT ๋ชจ๋ธ ๋กœ๋”ฉ SentenceTransformer

  2. ์‚ฌ์šฉ์ž ๋ฌธ์žฅ ์ธ์ฝ”๋”ฉ model.encode(user_input)

  3. ๋งค๋ฌผ ์„ค๋ช… or ํ‚ค์›Œ๋“œ ๋ฌธ์žฅ ์ธ์ฝ”๋”ฉ model.encode(listing_texts)

  4. ๋ฒกํ„ฐ ๊ฐ„ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„ ๊ณ„์‚ฐ

  5. ์œ ์‚ฌ๋„ ๋†’์€ ์ˆœ์œผ๋กœ ๋งค๋ฌผ ์ถ”์ฒœ

๐Ÿ”ท ์ถ”์ฒœ ๋ชจ๋ธ๋กœ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋Š” SBERT ๋ชจ๋ธ

๋ชจ๋ธ ์ด๋ฆ„ํŠน์ง•

paraphrase-multiligual-MiniLM-L12-v2

๋‹ค๊ตญ์–ด ์ง€์›/๋น ๋ฆ„/ํ•œ๊ตญ์–ด ํฌํ•จ

distiluse-base-multilingual-cased-v1

ํ•œ๊ตญ์–ด ์ž˜ ๋จ/์•ฝ๊ฐ„ ๋” ์ •ํ™•

jhgan/ko-sbert-nli

ํ•œ๊ตญ์–ด ํŠนํ™”(์งˆ๋ฌธ/๋ฌธ์žฅ ์œ ์‚ฌ๋„์— ์ตœ์ ํ™”)

์ฐธ๊ณ  ์‚ฌํ•ญ

  • โœ… ์ž„๋ฒ ๋”ฉ๋œ ๋ฒกํ„ฐ๋Š” numpy๋กœ ๋ณ€ํ™˜ํ•ด์„œ ์ €์žฅํ•ด๋‘๋ฉด DB ์กฐํšŒ ์‹œ ์†๋„ ๋น ๋ฆ„

  • โœ… model.encode()์—๋Š” normalize_embeddings=True ์˜ต์…˜ ์ฃผ๋ฉด ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„ ๊ณ„์‚ฐ ๋” ์ •ํ™•

  • โœ… ๋ฌธ์žฅ ์—ฌ๋Ÿฌ ๊ฐœ ํ•œ ๋ฒˆ์— ๋ฒกํ„ฐํ™” ๊ฐ€๋Šฅ