한글 난독화 코드 작성

한글 난독화 코드를 작성해보려고 한다.

목적 : 모델 성능 향상을 위한 데이터 증강

한글 난독화

번역기는 못 읽고 한국인만 읽을 수 있는 문장으로 변환해 드립니다.

https://airbnbfy.hanmesoft.com/

대표적인 한글을 난독화하는 사이트에서 제시하는 변환옵션을 기준으로 규칙을 세웠다.

다음 규칙에 기반하여 코드를 작성한다.

소리나는 대로 연음법칙 적용
뒤에 오는 자음을 받침으로 중복
자모를 비슷한 발음으로 변환
의미없는 받침 추가

소리나는 대로 연음법칙 적용

연음법칙이란? 자음으로 끝나는 음절에 모음으로 시작되는 형태소가 이어질 때, 앞음절의 끝소리가 뒷 음절의 첫 소리가 되는 음운현상을 의미한다.

연음법칙의 예시

책 + 이 → 채기
옷 + 을 → 오슬
낮 + 에 → 나제

연음법칙의 특징

홑받침이 연음이 될 경우에는 제 음가대로 음절 첫소리로 옮겨져 발음된다.
겹받침이 연음이 될 경우에는 ㄵ, ㄺ, ㄻ, ㄾ, ㄿ'의 경우 겹받침 중 두 번째 자음을 연음한다.
강 , 방 과 같이 'o' 으로 끝나는 말은 연음이 되지 않는다.

# 연음법칙 적용
# 복합 종성(ㄵ, ㄶ, ㄺ, ㄻ, ㅄ 등)의 인덱스
COMPLEX_FINALS = {3, 5, 6, 9, 10, 11,12, 13,14, 15,18}

def apply_liaison(word):

    result = []
    i = 0

    while i < len(word):

        if i < len(word) - 1:

            current_char = word[i]

            if "가" <= current_char <= "힣":
                next_char = word[i + 1]

                cho, jung, jong = split_syllable(current_char)
                next_cho, next_jung, next_jong = split_syllable(next_char)

                if( jong != 0) and (next_cho == 11) and (jong not in COMPLEX_FINALS):   # 종성이 있고, 다음 글자가 'ㅇ'으로 시작할 때

                    # 종성을 다음 글자의 초성으로 이동
                    new_cho = JONGSUNG_LIST[jong]
                    next_cho = CHOSUNG_LIST.index(new_cho)

                    # 현재 글자의 종성 제거
                    result.append(combine_syllable(cho, jung, 0))

                    # 다음 글자의 초성 변경
                    result.append(combine_syllable(next_cho, next_jung, next_jong))
                    i += 2  # 다음 글자까지 처리했으므로 2칸 이동

                else:
                    result.append(current_char)
                    i += 1
            else:
                result.append(word[i])
                i += 1
        else :
            result.append(word[i])
            i+=1

    return ''.join(result)

뒤에 오는 자음을 받침으로 중복

후기를 → 후길를
지구상 → 지굿상
어떤 → 얻떤

# 초성을 종성으로 변환하는 매핑
CHO_TO_JONG = {
    0: 1, 1: 2, 2: 4, 3: 7, 5: 8, 6: 16, 9: 19, 10: 20, 11: 21, 12: 22, 14: 23, 15: 24, 16: 25, 17: 26
}

#뒤에 오는 자음을 받침으로 중복
def cho_to_jong(word):

    result = []

    for i in range(len(word)):

        syllable = word[i]

        if "가" <= syllable <= "힣":

            cho, jung, jong = split_syllable(syllable)

            if jong == 0 and i < len(word) - 1:
                next_syllable = word[i + 1]
                next_cho, next_jung, next_jong = split_syllable(next_syllable)

                # 초성을 종성으로 변환
                if next_cho in CHO_TO_JONG:
                    jong = CHO_TO_JONG[next_cho]
                    new_syllable = combine_syllable(cho, jung, jong)
                    result.append(new_syllable)

                else:
                    result.append(syllable)

            else:
                result.append(syllable)

        else :
            result.append(syllable)

    return ''.join(result)

자모를 비슷한 발음으로 변환

자음의 변형

ㅂ → ㅃ (방 → 빵)
ㅅ → ㅆ (숙박 → 쑥박)
ㄷ → ㅌ (된다 → 퇸타)
ㄱ → ㅋ (규칙 → 큐칙)

모음의 변형

ㅏ → ㅑ
ㅓ → ㅔ
ㅚ → ㅙ , ㅢ
ㅟ →ㅞ , ㅢ

# 자모를 비슷한 발음으로 변환
def transform_hangul(text):
    result = []

    for char in text:

        if "가" <= char <= "힣":

            current_char = char
            cho, jung, jong = split_syllable(current_char)

            #초성변환
            chosung_char = CHOSUNG_LIST[cho]
            if chosung_char in CHOSUNG_MAP:
                chosung_char = random.choice(CHOSUNG_MAP[chosung_char])
            new_cho = CHOSUNG_LIST.index(chosung_char)

            # 중성 변환
            jungsung_char = JUNGSUNG_LIST[jung]
            if jungsung_char in JUNGSUNG_MAP:
                jungsung_char = random.choice(JUNGSUNG_MAP[jungsung_char])
            new_jung = JUNGSUNG_LIST.index(jungsung_char)

            # 종성 변환
            jongsung_char = JONGSUNG_LIST[jong]
            if jongsung_char in JONGSUNG_MAP:
                jongsung_char = random.choice(JONGSUNG_MAP[jongsung_char])
            new_jong = JONGSUNG_LIST.index(jongsung_char)

            result.append(combine_syllable(new_cho, new_jung, new_jong))

        else:
            result.append(char)  # 한글이 아니면 그대로 추가

    return ''.join(result)

의미없는 받침 추가

말 그대로 받침이 없는 글자의 경우 받침을 추가하는 것이다.

해외여행 → 햇욍영행
모르게 → 못륵겍
만들어내는 → 만들얶냈는

📌 한글 종성(받침)의 유니코드 인덱스 범위

한글의 종성(받침)은 0부터 27까지 총 28개가 있다. (0은 받침이 없는 상태를 말한다)

["", "ㄱ", "ㄲ", "ㄳ", "ㄴ", "ㄵ", "ㄶ", "ㄷ", "ㄹ", "ㄺ", "ㄻ", "ㄼ", "ㄽ", "ㄾ", "ㄿ", "ㅀ", "ㅁ", "ㅂ", "ㅄ", "ㅅ", "ㅆ", "ㅇ", "ㅈ", "ㅊ", "ㅋ", "ㅌ", "ㅍ", "ㅎ"]

# 의미 없는 받침 추가
def add_random_jongseong(word):

    result = []
    i = 0

    for char in word :

        if "가" <= char <= "힣":

            current_char = char
            cho, jung, jong = split_syllable(current_char)

            if jong == 0 : #종성이 없을 경우
                random_jong = random.choice(JONGSUNG_LIST)
                new_jong = JONGSUNG_LIST.index(random_jong)

                # 현재 글자의 종성 추가
                result.append(combine_syllable(cho, jung, new_jong))
                i += 1

            else:
                result.append(current_char)
                i += 1
        else:
            result.append(char)
            i += 1

    return ''.join(result)

전체코드 🤖

import random

def obfuscate_korean(text, settings):
    methods_all = [
        ("apply_liaison", apply_liaison),  # 연음현상 적용
        ("cho_to_jong", cho_to_jong),  # 초성을 종성으로 이동
        ("transform_hangul", transform_hangul),  # 한글 자모 변형
        ("add_random_jongseong", add_random_jongseong),  # 종성 추가
    ]

    methods_short = [
        ("transform_hangul", transform_hangul),  # 한글 자모 변형
        ("add_random_jongseong", add_random_jongseong),  # 종성 추가
    ]

    obfuscated_words = []

    for word in text.split():
        methods = methods_short if len(word) == 1 else methods_all

        for name, method in methods:
            if random.random() < settings.get(name, 0):  # 확률에 따라 적용
                word = method(word)

        obfuscated_words.append(word)

    return ' '.join(obfuscated_words)

# 설정( 난독화 방법 적용률 )
settings = {
    "transform_hangul": 0.8,
    "add_random_jongseong": 0.7,
    "apply_liaison": 0.5,
    "cho_to_jong": 0.6
}

# test
input_text = "가격 대비 깨끗한 침구류와 친절하신 사장님 덕분에 잘 쉬다 갑니다."
output = obfuscate_korean(input_text, settings)
print("변환된 문장:", output)

# output 
변환된 문장: 꺜켞 덆삅 캚끳햖 침굴륭왁 친쪓핬쒾 샺쨯뉨 덕분넹 쨜 쉳탸 갑닍다.

Google Colab

https://colab.research.google.com/drive/1NFBniV2DzBOByQuUxDL3bUaPHw_pZNIE?usp=sharing