fuzzy검색

Project/fullmemo

fuzzy검색

fullfish 2023. 12. 12. 22:38

Regular Expression을 이용한 검색

이전에 만들었던

퍼지(fuzzy) 검색 (정규표현식이용)

fuzzy logic : 불분명한 상태, 모호한 상태를 참 혹은 거짓의 이진 논리에서 벗어난 다치성으로 표현하는 논리개념 (위키백과) 확률론과 근본적으로 다른것이 부엌과 침실사이에 서 있을때 50%확률

fullfish.tistory.com

fuzzy검색의 highlight와 가중치 적용

참고 : https://taegon.kim/archives/9919 [JS] 한글도 지원하는 퍼지 문자열 검색 UI 작업을 하다보면 목록을 검색해야 할 때가 많다. 그런데 사람의 기억이라는 게 정확하지 않아서 혹은 전부 입력하기 귀

fullfish.tistory.com

이 부분들을 현재 프로젝트에 맞게 수정 및 업그레이드 했다

변경점

1. 원래는 제목이나 내용 하나만 검색이 가능했는데 둘중 하나 및 둘다 동시 검색이 가능하게끔 바꾸었으며 동시 검색시에 얼마나 더 정확한지는 제목과 내용 각각의 거리중 적은 값을 사용함

2. 원래는 함수를 각각써서 검색값에 맞는 요소들 filter후에 거리에 따른 sort를하고 highlight를 줬는데 한번에 하도록 바꿈 그래서 1번 순회로 가능해서 효율성이 좋아짐

코드

function highlightText(
  text: string,
  regex: RegExp,
  searchStr: string,
  totalDistance: number,
) {
  let isMatched = false;
  let highlightedText = text.replace(regex, (match, ...groups) => {
    isMatched = true;
    const letters = groups.slice(0, searchStr.length);
    let lastIndex = 0;
    let redColor = [];

    for (let i = 0, l = letters.length; i < l; i++) {
      const idx = match.indexOf(letters[i], lastIndex);
      redColor.push(match.substring(lastIndex, idx));
      redColor.push(`|highlight|${letters[i]}`);
      if (lastIndex > 0) {
        totalDistance += idx - lastIndex;
      }
      lastIndex = idx + 1;
    }
    return redColor.join('');
  });

  return {text: highlightedText, distance: totalDistance, matched: isMatched};
}

export const filterAndSortByTotalDistance = (
  dataList: Imemo[],
  searchStr: string,
  searchScope: 'title' | 'content' | 'all',
) => {
  if (searchStr === undefined) return dataList;
  const regex = createFuzzyMatcher(searchStr);

  return dataList
    .reduce<IsearchMemo[]>((filteredAndSorted, ele) => {
      let totalDistance = 0;
      let isMatched = false;
      let highlightedTitle = ele.title;
      let highlightedContent = ele.content;

      // Title 검색
      if (searchScope === 'title' || searchScope === 'all') {
        let titleResult = highlightText(
          ele.title,
          regex,
          searchStr,
          totalDistance,
        );
        highlightedTitle = titleResult.text;
        totalDistance = titleResult.distance;
        isMatched = titleResult.matched || isMatched;
      }

      // Content 검색
      if (searchScope === 'content' || searchScope === 'all') {
        let contentResult = highlightText(
          ele.content,
          regex,
          searchStr,
          totalDistance,
        );
        highlightedContent = contentResult.text;
        totalDistance = Math.min(totalDistance, contentResult.distance);
        isMatched = contentResult.matched || isMatched;
      }

      if (isMatched) {
        filteredAndSorted.push({
          ...ele,
          highlightedTitle,
          highlightedContent,
          totalDistance,
        });
      }

      return filteredAndSorted;
    }, [])
    .sort((a, b) => a.totalDistance - b.totalDistance);
};

그리고

랜더 해주는쪽

<CustomHighlightedText
  style={titleStyle(width)}
  text={memoItem?.highlightedContent}
  searchText={searchText}
  highlightStyle={{color: 'red'}}
  numberOfLines={1}
/>

// CustomHighlightedText 컴포
import React from 'react';
import {Text} from 'react-native';
import CustomText from './CustomText';
import {createFuzzyMatcher} from '~/utils/fuzzy/regularExpression';

type ModeType = 'content' | 'title' | 'explanation';
type HighlightInfo = {
  s: string;
  highlight: boolean;
};
interface IProps {
  text: string;
  style?: any;
  searchText: string;
  numberOfLines?: number;
  mode?: ModeType;
  highlightStyle?: any;
}

const CustomHighlightedText = ({
  text,
  style,
  searchText,
  numberOfLines,
  mode,
  highlightStyle,
}: IProps) => {
  if (!text) return;
  let highlightInfoList: HighlightInfo[] = [];
  const parts = text.split('|highlight|');
  parts.forEach((part, index) => {
    for (let i = 0; i < part.length; i++) {
      if (i === 0 && index > 0)
        highlightInfoList.push({s: part[i], highlight: true});
      else highlightInfoList.push({s: part[i], highlight: false});
    }
  });
  // 하이라이트 적용
  return (
    <Text style={style} numberOfLines={numberOfLines} ellipsizeMode="clip">
      {highlightInfoList.map((ele, index) => {
        if (ele.highlight) {
          return (
            <CustomText
              key={index}
              text={ele.s}
              style={highlightStyle}
              mode={mode}
            />
          );
        } else {
          return <CustomText key={index} text={ele.s} mode={mode} />;
        }
      })}
    </Text>
  );
};

export default CustomHighlightedText;

이 코드들은 'abcdec'에서 'cc'를 검색시

문제점

fuzzy검색중 단일 방법만 사용한다면 이렇게 fuzzy검색 단계에서 문자열을 변경해주고 해당 문자열을 이용해서 랜더시에 하이라이팅을 해주면되는데 fuzzy검색 방법 여러개를 중첩해서 사용한다면 각각의 방법에서의 출력 문자열에 |highlight|가 들어가면 이상하고

또한 LevenshteinDistance와 N-gram의 경우 일치하는 문자가아닌 유사한 문자 및 문장이다보니 어느 글자에 하이라이팅을 하기가 애매해진다

해결법

fuzzy검색 함수 자체는 원본문자열을 return하고 랜더해주는 부분에서 검색어로 정규식검색을 해서 하이라이팅하는게 옳아 보인다

보는 입장에서도 내 검색어와 어떠한 부분이 일치한 곳인지 직관적으로 더 잘 알 수 있겠다

해결법대로한

Regular Expression을 이용한 검색

const {escapeRegExp} = require('lodash');

interface ChoArr {
  [key: string]: number;
}
export const chageUnicode = (ch: string) => {
  const offset = 44032; /* '가'의 코드 */
  // 초성 + 중성 + (종성) 완전한 문자일 경우
  if (/[가-힣]/.test(ch)) {
    const chCode = ch.charCodeAt(0) - offset;
    // 종성이 있으면 문자 그대로를 찾는다.
    if (chCode % 28 > 0) {
      return ch;
    }
    const begin = Math.floor(chCode / 28) * 28 + offset;
    const end = begin + 27;
    return `[\\u${begin.toString(16)}-\\u${end.toString(16)}]`;
  }
  // 초성만 있을 경우
  if (/[ㄱ-ㅎ]/.test(ch)) {
    const choArr: ChoArr = {
      ㄱ: '가'.charCodeAt(0),
      ㄲ: '까'.charCodeAt(0),
      ㄴ: '나'.charCodeAt(0),
      ㄷ: '다'.charCodeAt(0),
      ㄸ: '따'.charCodeAt(0),
      ㄹ: '라'.charCodeAt(0),
      ㅁ: '마'.charCodeAt(0),
      ㅂ: '바'.charCodeAt(0),
      ㅃ: '빠'.charCodeAt(0),
      ㅅ: '사'.charCodeAt(0),
    };
    const begin =
      choArr[ch] ||
      (ch.charCodeAt(0) - 12613) /* 'ㅅ'의 코드 */ * 588 + choArr['ㅅ'];
    const end = begin + 587;
    return `[${ch}\\u${begin.toString(16)}-\\u${end.toString(16)}]`;
  }
  // 그 외엔 그대로 내보냄
  return escapeRegExp(ch); // 정규식에서 의미있는 와일드카드들을 문자열로 바꿔주는거
};

// 정규식 패턴으로 변환
export const createFuzzyMatcher = (str: string) => {
  if (str === undefined) return new RegExp('.');
  const pattern = str
    .split('')
    .map(chageUnicode)
    .map(pattern => '(' + pattern + ')')
    .join('.*?');
  return new RegExp(pattern);
};

function highlightText(
  text: string,
  regex: RegExp,
  searchStr: string,
  totalDistance: number,
) {
  let isMatched = false;
  let highlightedText = text.replace(regex, (match, ...groups) => {
    isMatched = true;
    const letters = groups.slice(0, searchStr.length);
    let lastIndex = 0;
    let redColor = [];

    for (let i = 0, l = letters.length; i < l; i++) {
      const idx = match.indexOf(letters[i], lastIndex);
      redColor.push(match.substring(lastIndex, idx));
      redColor.push(`${letters[i]}`);
      if (lastIndex > 0) {
        totalDistance += idx - lastIndex;
      }
      lastIndex = idx + 1;
    }
    return redColor.join('');
  });

  return {text: highlightedText, distance: totalDistance, matched: isMatched};
}

export const filterAndSortByTotalDistance = (
  dataList: Imemo[],
  searchStr: string,
  searchScope: 'title' | 'content' | 'all',
) => {
  if (searchStr === undefined) return dataList;
  const regex = createFuzzyMatcher(searchStr);

  return dataList
    .reduce<IsearchMemo[]>((filteredAndSorted, ele) => {
      let totalDistance = 0;
      let isMatched = false;

      // Title 검색
      if (searchScope === 'title' || searchScope === 'all') {
        let titleResult = highlightText(
          ele.title,
          regex,
          searchStr,
          totalDistance,
        );
        totalDistance = titleResult.distance;
        isMatched = titleResult.matched || isMatched;
      }

      // Content 검색
      if (searchScope === 'content' || searchScope === 'all') {
        let contentResult = highlightText(
          ele.content,
          regex,
          searchStr,
          totalDistance,
        );
        totalDistance = Math.min(totalDistance, contentResult.distance);
        isMatched = contentResult.matched || isMatched;
      }

      if (isMatched) {
        filteredAndSorted.push({
          ...ele,
          totalDistance,
        });
      }

      return filteredAndSorted;
    }, [])
    .sort((a, b) => a.totalDistance - b.totalDistance);
};

// 사용 하는곳
const regExpMemoList = filterAndSortByTotalDistance(
  mainData.memoList,
  searchText,
  searchScope,
);

제목 및 내용을 선택 및 동시 검색이 되며 유사도 검색하고 유사한 정도에따라 sort를 함

LevenshteinDistance를 이용한 fuzzy 검색

레벤슈타인 거리 (Levenshtein Distance)

레벤슈타인 거리란 문자열의 유사도를 검사하는 기본적인 알고리즘으로 편집 거리라고도 부름 a문자열에서 b문자열로 편집할때 몇번의 조작이 필요한지를 도출해낸다 예를들어 '가나다라'와 '

fullfish.tistory.com

레벤슈타인 거리 시간복잡도와 공간복잡도 개선

이전 게시물 레벤슈타인 거리 (Levenshtein Distance) 레벤슈타인 거리란 문자열의 유사도를 검사하는 기본적인 알고리즘으로 편집 거리라고도 부름 a문자열에서 b문자열로 편집할때 몇번의 조작이

fullfish.tistory.com

이전에 이렇게 했었는데

공간복잡도 개선쪽에서 잘못 생각한 점이 2가지있는데

1. 이중배열에서 안쓰는 부분을 null로 넣으면 가비지콜렉터가 없애준다고 생각했는데 요소에 null값을 넣은 해당 이중배열자체를 계속 참조하고 있으므로 메모리에서 삭제를 하지않는다

2. 다음 값을 얻기위해서 이전 행과 열, 현재 행과 열이 필요하다고 했는데

이전 행, 현재 행만 필요해서 메모리자체를 획기적으로 줄일 수 있다

그래서 그 부분을 적용했으며

이것도 제목 및 내용, 모두를 선택할 수 있게 하였고 사용할때 간편하게 하도록 list 형태로 받아서 쓰게 변경했다

const levenshteinDistance = (
  str: string,
  search: string,
  maxDistance: number,
) => {
  if (search === undefined) return 0;
  if (str === search) return 0;

  let strLen = str.length;
  let searchLen = search.length;

  if (strLen === 0) return searchLen;
  if (searchLen === 0) return strLen;

  let previousRow = Array(searchLen + 1).fill(0);
  let currentRow = Array(searchLen + 1);

  // 첫 행 초기화
  for (let i = 0; i <= searchLen; i++) {
    previousRow[i] = i;
  }

  for (let i = 1; i <= strLen; i++) {
    currentRow[0] = i;

    for (let j = 1; j <= searchLen; j++) {
      let cost = str[i - 1] === search[j - 1] ? 0 : 1;
      currentRow[j] = Math.min(
        previousRow[j] + 1, // 문자 삭제
        currentRow[j - 1] + 1, // 문자 삽입
        previousRow[j - 1] + cost,
      ); // 문자 교체
    }

    // 현재 행 계산이 완료되면, 이전 행을 현재 행으로 교체하고 새로운 현재 행을 준비
    [previousRow, currentRow] = [currentRow, previousRow];

    // 타겟 거리보다 크면 중단
    if (Math.min(...previousRow) > maxDistance) break;
  }

  return previousRow[searchLen];
};
export const filterByLevenshteinDistance = (
  dataList: Imemo[],
  searchStr: string,
  maxDistance: number,
  searchScope: 'title' | 'content' | 'all',
) => {
  return dataList.filter(memo => {
    switch (searchScope) {
      case 'title':
        return (
          levenshteinDistance(memo.title, searchStr, maxDistance) <= maxDistance
        );
      case 'content':
        return (
          levenshteinDistance(memo.content, searchStr, maxDistance) <=
          maxDistance
        );
      case 'all':
        return (
          levenshteinDistance(memo.title, searchStr, maxDistance) <=
            maxDistance ||
          levenshteinDistance(memo.content, searchStr, maxDistance) <=
            maxDistance
        );
      default:
        return false;
    }
  });
};

// 시용 하는 곳
const levenshteinMemoList = filterByLevenshteinDistance(
  mainData.memoList,
  searchText,
  3,
  searchScope,
);

N-gram을 이용한 fuzzy검색

n-Gram

n-Gram이란 문장의 유사도를 비교하는 방법중 하나로 문장을 쪼개서 비교한다 예를 들어 3-gram으로 '과자중에 제일 맛있는건 새우깡' '제일 맛있는 과자는 무엇일까' 이 두문장을 비교한다면 각 문

fullfish.tistory.com

n-Gram 개선 및 고찰

자음 모음단위로 n-Gram 저번에 구현한 n-gram https://fullfish.tistory.com/109 n-Gram n-Gram이란 문장의 유사도를 비교하는 방법중 하나로 문장을 쪼개서 비교한다 예를 들어 3-gram으로 '과자중에 제일 맛있는

fullfish.tistory.com

N-gram역시 이전에 이렇게 했었는데

마찬가지로 제목, 내용 또는 한번에 검색하게 했으며

사용하기 편하게 list형식을 받아서 쓰게 만들었다

그외는 변경된 점 없음

const chSplit = (ch: string) => {
  const rCho = [
    'ㄱ',
    'ㄲ',
    'ㄴ',
    'ㄷ',
    'ㄸ',
    'ㄹ',
    'ㅁ',
    'ㅂ',
    'ㅃ',
    'ㅅ',
    'ㅆ',
    'ㅇ',
    'ㅈ',
    'ㅉ',
    'ㅊ',
    'ㅋ',
    'ㅌ',
    'ㅍ',
    'ㅎ',
  ];
  const rJung = [
    'ㅏ',
    'ㅐ',
    'ㅑ',
    'ㅒ',
    'ㅓ',
    'ㅔ',
    'ㅕ',
    'ㅖ',
    'ㅗ',
    'ㅘ',
    'ㅙ',
    'ㅚ',
    'ㅛ',
    'ㅜ',
    'ㅝ',
    'ㅞ',
    'ㅟ',
    'ㅠ',
    'ㅡ',
    'ㅢ',
    'ㅣ',
  ];
  const rJong = [
    '',
    'ㄱ',
    'ㄲ',
    'ㄳ',
    'ㄴ',
    'ㄵ',
    'ㄶ',
    'ㄷ',
    'ㄹ',
    'ㄺ',
    'ㄻ',
    'ㄼ',
    'ㄽ',
    'ㄾ',
    'ㄿ',
    'ㅀ',
    'ㅁ',
    'ㅂ',
    'ㅄ',
    'ㅅ',
    'ㅆ',
    'ㅇ',
    'ㅈ',
    'ㅊ',
    'ㅋ',
    'ㅌ',
    'ㅍ',
    'ㅎ',
  ];
  let resultStr = '';
  for (let i = 0; i < ch.length; i++) {
    //만약 ch가 자음이 아닌 한글 문자일때만 이거 해당
    if (/[가-힣]/.test(ch[i])) {
      const nTmp = ch[i].charCodeAt(0) - 0xac00;
      const jong = nTmp % 28; // 종성
      const jung = ((nTmp - jong) / 28) % 21; // 중성
      const cho = ((nTmp - jong) / 28 - jung) / 21; // 초성

      resultStr += rCho[cho] + rJung[jung];
      if (rJong[jong] !== '') resultStr += rJong[jong];
    } else resultStr += ch[i];
  }
  return resultStr;
};

const ngram = (str: string, num: number) => {
  let arr = [];
  const repeat = str.length - num + 1;
  for (let i = 0; i < repeat; i++) {
    const split = str.slice(i, i + num);
    arr.push(split);
  }
  return arr;
};

export const diff_ngram = (text: string, search: string, num: number) => {
  if (search === undefined) return 0;
  text = chSplit(text);
  search = chSplit(search);
  let splitArrA = ngram(text, num);
  let splitArrB = ngram(search, num);
  const splitArrBLength = splitArrB.length;
  let count = 0;
  for (let i = 0; i < splitArrA.length; i++) {
    for (let j = 0; j < splitArrB.length; j++) {
      if (splitArrA[i] === splitArrB[j]) {
        count++;
        splitArrB.splice(j, 1);
        break;
      }
    }
  }
  return count / (splitArrA.length + splitArrBLength - count);
};

export const filterByDiffNgram = (
  dataList: Imemo[],
  searchStr: string,
  searchScope: 'title' | 'content' | 'all',
) => {
  return dataList.filter(memo => {
    switch (searchScope) {
      case 'title':
        return (
          diff_ngram(memo.title, searchStr, 3) >= 0.27 ||
          diff_ngram(memo.title, searchStr, 1) === 1
        );
      case 'content':
        return (
          diff_ngram(memo.title, searchStr, 3) >= 0.27 ||
          diff_ngram(memo.title, searchStr, 1) === 1
        );
      case 'all':
        return (
          diff_ngram(memo.title, searchStr, 3) >= 0.27 ||
          diff_ngram(memo.title, searchStr, 1) === 1 ||
          diff_ngram(memo.title, searchStr, 3) >= 0.27 ||
          diff_ngram(memo.title, searchStr, 1) === 1
        );
    }
  });
};

// 사용 하는 곳
const nGramMemoList = filterByDiffNgram(
  mainData.memoList,
  searchText,
  searchScope,
);

중첩 사용

정규식, 리벤슈타인거리, n-gram 순으로 가중치를 둬서 중첩사용하게 했다

const regExpMemoList = filterByRegularExpression(
  mainData.memoList,
  searchText,
  searchScope,
);
const levenshteinMemoList = filterByLevenshteinDistance(
  mainData.memoList,
  searchText,
  3,
  searchScope,
);
const nGramMemoList = filterByDiffNgram(
  mainData.memoList,
  searchText,
  searchScope,
);
const resultList = [...regExpMemoList];
for (let i = 0; i < levenshteinMemoList.length; i++) {
  if (!resultList.map(ele => ele.id).includes(levenshteinMemoList[i].id)) {
    resultList.push(levenshteinMemoList[i]);
  }
}
for (let i = 0; i < nGramMemoList.length; i++) {
  if (!resultList.map(ele => ele.id).includes(nGramMemoList[i].id)) {
    resultList.push(nGramMemoList[i]);
  }
}

<Text> 부분

하이라이팅을 적용했다

적용 방법은 하이라이팅 할 글자의 앞에 |highlight|를 붙여주고 해당 문자열 바로 뒷글자에 하이라이팅이 들어가게 했다

import React from 'react';
import {Text} from 'react-native';
import CustomText from './CustomText';
import {createFuzzyMatcher} from '~/utils/search/regularExpression';

type ModeType = 'content' | 'title' | 'explanation';
type HighlightInfo = {
  s: string;
  highlight: boolean;
};
interface IProps {
  text: string;
  style?: any;
  searchText: string;
  scope: 'title' | 'content';
  searchScope?: 'all' | 'title' | 'content';
  numberOfLines?: number;
  mode?: ModeType;
  highlightStyle?: any;
}

const changeHighlightStr = (text: string, searchText: string) => {
  // 완벽일치시 그것 먼저 색 바꿈
  if (text.indexOf(searchText) !== -1) {
    let redColor = [];
    let count = 0;
    for (let i = 0; i < text.length; i++) {
      if (i >= text.indexOf(searchText) && count !== searchText.length) {
        redColor.push(`|highlight|${text[i]}`);
        count++;
      } else redColor.push(text[i]);
    }
    return redColor.join('');
  }
  // 정규식 이용한 fuzzy검색결과 색 바꿈
  const regex = createFuzzyMatcher(searchText);
  const resultData = text.replace(regex, (match, ...groups) => {
    const letters = groups.slice(0, searchText.length);
    let lastIndex = 0;
    let redColor = [];
    for (let i = 0, l = letters.length; i < l; i++) {
      const idx = match.indexOf(letters[i], lastIndex);
      redColor.push(match.substring(lastIndex, idx));
      redColor.push(`|highlight|${letters[i]}`);
      lastIndex = idx + 1;
    }
    return redColor.join('');
  });
  if (resultData !== text) return resultData;
  // 리벤슈타인거리 이용한 검색 결과 색 바꿈
  else {
    let redColor = [];
    if (searchText === undefined) return;
    for (let i = 0; i < text.length; i++) {
      if (searchText.indexOf(text[i]) === -1) {
        redColor.push(text[i]);
      } else {
        redColor.push(`|highlight|${text[i]}`);
      }
    }
    return redColor.join('');
  }
};
const CustomHighlightedText = ({
  text,
  style,
  searchText,
  scope,
  searchScope = 'all',
  numberOfLines,
  mode,
  highlightStyle,
}: IProps) => {
  let highlightInfoList: HighlightInfo[] = [];
  if (searchScope === 'all' || searchScope === scope) {
    const highlightedStr = changeHighlightStr(text, searchText);
    if (!highlightedStr) return;
    const parts = highlightedStr.split('|highlight|');
    parts.forEach((part, index) => {
      for (let i = 0; i < part.length; i++) {
        if (i === 0 && index > 0)
          highlightInfoList.push({s: part[i], highlight: true});
        else highlightInfoList.push({s: part[i], highlight: false});
      }
    });
  } else {
    highlightInfoList = text.split('').map(ele => {
      return {s: ele, highlight: false};
    });
  }
  // 하이라이트 적용
  return (
    <Text style={style} numberOfLines={numberOfLines} ellipsizeMode="clip">
      {highlightInfoList.map((ele, index) => {
        if (ele.highlight) {
          return (
            <CustomText
              key={index}
              text={ele.s}
              style={highlightStyle}
              mode={mode}
            />
          );
        } else {
          return <CustomText key={index} text={ele.s} mode={mode} />;
        }
      })}
    </Text>
  );
};

export default CustomHighlightedText;

저작자표시

'Project > fullmemo' 카테고리의 다른 글

fullmemo 소개 (0)	2024.01.07
fullmemo 개인정보 처리방침 (0)	2023.12.15

현재글fuzzy검색

fullfish의 개발자 도전기

fuzzy검색

Regular Expression을 이용한 검색

Regular Expression을 이용한 검색

LevenshteinDistance를 이용한 fuzzy 검색

N-gram을 이용한 fuzzy검색

중첩 사용

<Text> 부분

'Project > fullmemo' 카테고리의 다른 글

'Project/fullmemo'의 다른글

티스토리툴바

fuzzy검색

Regular Expression을 이용한 검색

Regular Expression을 이용한 검색

LevenshteinDistance를 이용한 fuzzy 검색

N-gram을 이용한 fuzzy검색

중첩 사용

<Text> 부분

'Project > fullmemo' 카테고리의 다른 글

'Project/fullmemo'의 다른글

관련글

티스토리툴바