本文共 27055 字,大约阅读时间需要 90 分钟。
作者
digoal日期
2017-04-26标签
PostgreSQL , 模糊查询 , 正则查询 , pg_trgm , bytea , gin , 函数索引背景
前模糊(有前缀的模糊),后模糊(有后缀的模糊),前后模糊(无前后缀的模糊),正则匹配都属于文本搜索领域常见的需求。PostgreSQL拥有很强的文本搜索能力,除了支持全文检索,还支持模糊查询、正则查询。内置的pg_trgm插件是一般数据库没有的,可能很多人没有听说过。同时还内置了表达式索引、GIN索引的功能。
不同的模糊查询需求,有不同的优化方法。
对于前模糊和后模糊,PostgreSQL则与其他数据库一样,可以使用btree来加速。后模糊可以使用反转函数的函数索引来加速。
对于前后模糊和正则匹配,一种方法是使用pg_trgm插件,利用GIN索引加速模糊和正则查询(输入3个或3个以上字符的模糊查询效果很好)。另一种方法是自定义GIN表达式索引的方法,适合于定制的模糊查询。
使用b-tree可以支持前模糊的查询。
1.1 当使用类型默认的index ops class时,仅适合于collate="C"的查询(当数据库默认的lc_collate<>C时,索引和查询都需要明确指定collate "C")。
索引、查询条件的collate必须一致才能使用索引。
例子
test=# create table test(id int, info text); CREATE TABLE test=# insert into test select generate_series(1,1000000),md5(random()::text); INSERT 0 1000000 test=# create index idx on test(info collate "C"); CREATE INDEX test=# explain (analyze,verbose,timing,costs,buffers) select * from test where info like 'abcd%' collate "C"; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------- Index Scan using idx on public.test (cost=0.42..16.76 rows=100 width=37) (actual time=0.057..0.093 rows=18 loops=1) Output: id, info Index Cond: ((test.info >= 'abcd'::text) AND (test.info < 'abce'::text)) Filter: (test.info ~~ 'abcd%'::text COLLATE "C") Buffers: shared hit=18 read=3 Planning time: 0.424 ms Execution time: 0.124 ms (7 rows)
1.2 当数据库默认的lc_collate<>C时,还有一种方法让b-tree索引支持模糊查询。使用对应类型的pattern ops,使用pattern ops将使用字符的查询方式而非binary的搜索方式。
文档中有如下解释
The operator classes text_pattern_ops, varchar_pattern_ops, and bpchar_pattern_ops support B-tree indexes on the types text, varchar, and char respectively. The difference from the default operator classes is that the values are compared strictly character by character rather than according to the locale-specific collation rules. This makes these operator classes suitable for use by queries involving pattern matching expressions (LIKE or POSIX regular expressions) when the database does not use the standard "C" locale.
例子
test=# drop table test; DROP TABLE test=# create table test(id int, info text); CREATE TABLE test=# insert into test select generate_series(1,1000000),md5(random()::text); INSERT 0 1000000 test=# create index idx on test(info text_pattern_ops); CREATE INDEX test=# explain (analyze,verbose,timing,costs,buffers) select * from test where info like 'abcd%' collate "zh_CN"; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------- Index Scan using idx on public.test (cost=0.42..16.76 rows=100 width=37) (actual time=0.038..0.059 rows=12 loops=1) Output: id, info Index Cond: ((test.info ~>=~ 'abcd'::text) AND (test.info ~<~ 'abce'::text)) Filter: (test.info ~~ 'abcd%'::text COLLATE "zh_CN") Buffers: shared hit=12 read=3 Planning time: 0.253 ms Execution time: 0.081 ms (7 rows) test=# explain (analyze,verbose,timing,costs,buffers) select * from test where info like 'abcd%' collate "C"; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------- Index Scan using idx on public.test (cost=0.42..16.76 rows=100 width=37) (actual time=0.027..0.050 rows=12 loops=1) Output: id, info Index Cond: ((test.info ~>=~ 'abcd'::text) AND (test.info ~<~ 'abce'::text)) Filter: (test.info ~~ 'abcd%'::text COLLATE "C") Buffers: shared hit=15 Planning time: 0.141 ms Execution time: 0.072 ms (7 rows)
使用类型对应的pattern ops,索引搜索不仅支持LIKE的写法,还支持规则表达式的写法,如下:
test=# explain (analyze,verbose,timing,costs,buffers) select * from test where info ~ '^abcd'; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------- Index Scan using idx on public.test (cost=0.42..16.76 rows=100 width=37) (actual time=0.031..0.061 rows=12 loops=1) Output: id, info Index Cond: ((test.info ~>=~ 'abcd'::text) AND (test.info ~<~ 'abce'::text)) Filter: (test.info ~ '^abcd'::text) Buffers: shared hit=15 Planning time: 0.213 ms Execution time: 0.083 ms (7 rows)
使用反转函数(reverse)索引,可以支持后模糊的查询。
2.1 当使用类型默认的index ops class时,仅适合于collate="C"的查询(当数据库默认的lc_collate<>C时,索引和查询都需要明确指定collate "C")。
索引、查询条件的collate必须一致才能使用索引。
例子
test=# create index idx1 on test(reverse(info) collate "C"); CREATE INDEX test=# select * from test limit 1; id | info ----+---------------------------------- 1 | b3275976cdd437a033d4329775a52514 (1 row) test=# explain (analyze,verbose,timing,costs,buffers) select * from test where reverse(info) like '4152%' collate "C"; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------- Index Scan using idx1 on public.test (cost=0.42..4009.43 rows=5000 width=37) (actual time=0.061..0.097 rows=18 loops=1) Output: id, info Index Cond: ((reverse(test.info) >= '4152'::text) AND (reverse(test.info) < '4153'::text)) Filter: (reverse(test.info) ~~ '4152%'::text COLLATE "C") Buffers: shared hit=18 read=3 Planning time: 0.128 ms Execution time: 0.122 ms (7 rows) test=# select * from test where reverse(info) like '4152%' collate "C"; id | info --------+---------------------------------- 847904 | abe2ecd90393b5275df8e34a39702514 414702 | 97f66d26545329321164042657d02514 191232 | 7820972c6220c2b01d46c11ebb532514 752742 | 93232ac39c6632e2540df44627c42514 217302 | 39e518893a1a7b1e691619bd1fc42514 1 | b3275976cdd437a033d4329775a52514 615718 | 4948f94c484c13dc6c4fae8a3db52514 308815 | fc2918ceff7c7a4dafd2e04031062514 149521 | 546d963842ea5ca593e622c810262514 811093 | 4b6eca2eb6b665af67b2813e91a62514 209000 | 1dfd0d4e326715c1739f031cca992514 937616 | 8827fd81f5b673fb5afecbe3e11b2514 419553 | bd6e01ce360af16137e8b6abc8ab2514 998324 | 7dff51c19dc5e5d9979163e7d14c2514 771518 | 8a54e30003a48539fff0aedc73ac2514 691566 | f90368348e3b6bf983fcbe10db2d2514 652274 | 8bf4a97b5f122a5540a21fa85ead2514 233437 | 739ed715fc203d47e37e79b5bcbe2514 (18 rows)
2.2 当数据库默认的lc_collate<>C时,还有一种方法让b-tree索引支持模糊查询。使用对应类型的pattern ops,使用pattern ops将使用字符的查询方式而非binary的搜索方式。
使用类型对应的pattern ops,索引搜索不仅支持LIKE的写法,还支持规则表达式的写法。
例子
test=# create index idx1 on test(reverse(info) text_pattern_ops); CREATE INDEX test=# explain (analyze,verbose,timing,costs,buffers) select * from test where reverse(info) like '4152%'; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------- Index Scan using idx1 on public.test (cost=0.42..4009.43 rows=5000 width=37) (actual time=0.026..0.049 rows=12 loops=1) Output: id, info Index Cond: ((reverse(test.info) ~>=~ '4152'::text) AND (reverse(test.info) ~<~ '4153'::text)) Filter: (reverse(test.info) ~~ '4152%'::text) Buffers: shared hit=15 Planning time: 0.102 ms Execution time: 0.072 ms (7 rows) test=# explain (analyze,verbose,timing,costs,buffers) select * from test where reverse(info) ~ '^4152'; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------- Index Scan using idx1 on public.test (cost=0.42..4009.43 rows=5000 width=37) (actual time=0.031..0.063 rows=12 loops=1) Output: id, info Index Cond: ((reverse(test.info) ~>=~ '4152'::text) AND (reverse(test.info) ~<~ '4153'::text)) Filter: (reverse(test.info) ~ '^4152'::text) Buffers: shared hit=15 Planning time: 0.148 ms Execution time: 0.087 ms (7 rows)
使用pg_trgm索引,可以支持前、后模糊的查询。
注意:
(有前缀的模糊)至少输入1个字符,(有后缀的模糊)至少输入2个字符,才有好的索引过滤效果。
如果要高效支持多字节字符(例如中文),数据库lc_ctype不能为"C",只有TOKEN分割正确效果才是OK的。(lc_ctype设置正确,才能够正确的逐一分割多字节字符串中的文字: LC_CTYPE: Character classification (What is a letter? Its upper-case equivalent?))。
索引、查询条件的collate必须一致才能使用索引。
test=# \l+ test List of databases Name | Owner | Encoding | Collate | Ctype | Access privileges | Size | Tablespace | Description ------+----------+----------+------------+------------+-------------------+--------+------------+------------- test | postgres | UTF8 | zh_CN.utf8 | zh_CN.utf8 | | 245 MB | pg_default | (1 row) test=# create extension pg_trgm; test=# create table test001(c1 text); CREATE TABLE
生成随机中文字符串的函数
test=# create or replace function gen_hanzi(int) returns text as $$ declare res text; begin if $1 >=1 then select string_agg(chr(19968+(random()*20901)::int), '') into res from generate_series(1,$1); return res; end if; return null; end; $$language plpgsql strict; CREATE FUNCTION
生成随机数据
test=# insert into test001 select gen_hanzi(20) from generate_series(1,100000); INSERT 0 100000 test=# create index idx_test001_1 on test001 using gin (c1 gin_trgm_ops); CREATE INDEX test=# select * from test001 limit 5; c1 ------------------------------------------ 埳噪办甾讷昃碇玾陧箖燋邢賀浮媊踮菵暔谉橅 秌橑籛鴎拟倶敤麁鼋醠轇坙騉鏦纗蘛婃坹娴儅 蔎緾鎧爪鵬二悲膼朠麻鸂鋬楨窷違繇糭嘓索籓 馳泅薬鐗愅撞窍浉渗蛁灎厀攚摐瞪拡擜詜隝緼 襳铺煃匶瀌懲荼黹樆惺箧搔羾憯墆鋃硍蔓恧顤 (5 rows)
模糊查询
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '你%'; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on public.test001 (cost=5.08..15.20 rows=10 width=61) (actual time=0.030..0.034 rows=3 loops=1) Output: c1 Recheck Cond: (test001.c1 ~~ '你%'::text) Heap Blocks: exact=3 Buffers: shared hit=7 -> Bitmap Index Scan on idx_test001_1 (cost=0.00..5.08 rows=10 width=0) (actual time=0.020..0.020 rows=3 loops=1) Index Cond: (test001.c1 ~~ '你%'::text) Buffers: shared hit=4 Planning time: 0.119 ms Execution time: 0.063 ms (10 rows) test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '%恧顤'; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on public.test001 (cost=5.08..15.20 rows=10 width=61) (actual time=0.031..0.034 rows=1 loops=1) Output: c1 Recheck Cond: (test001.c1 ~~ '%恧顤'::text) Rows Removed by Index Recheck: 1 Heap Blocks: exact=2 Buffers: shared hit=6 -> Bitmap Index Scan on idx_test001_1 (cost=0.00..5.08 rows=10 width=0) (actual time=0.020..0.020 rows=2 loops=1) Index Cond: (test001.c1 ~~ '%恧顤'::text) Buffers: shared hit=4 Planning time: 0.136 ms Execution time: 0.062 ms (11 rows)
使用pg_trgm插件,支持前后模糊的查询。
注意:
如果要让pg_trgm高效支持多字节字符(例如中文),数据库lc_ctype不能为"C",只有TOKEN分割正确效果才是OK的。(lc_ctype设置正确,才能够正确的逐一分割多字节字符串中的文字: Character classification (What is a letter? Its upper-case equivalent?))。
建议输入3个或3个以上字符,否则效果不佳(后面会分析原因)。
例子
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '%燋邢賀%'; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on public.test001 (cost=5.08..15.20 rows=10 width=61) (actual time=0.038..0.038 rows=1 loops=1) Output: c1 Recheck Cond: (test001.c1 ~~ '%燋邢賀%'::text) Heap Blocks: exact=1 Buffers: shared hit=5 -> Bitmap Index Scan on idx_test001_1 (cost=0.00..5.08 rows=10 width=0) (actual time=0.025..0.025 rows=1 loops=1) Index Cond: (test001.c1 ~~ '%燋邢賀%'::text) Buffers: shared hit=4 Planning time: 0.170 ms Execution time: 0.076 ms (10 rows) test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '%燋邢%'; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on public.test001 (cost=7615669.08..7615679.20 rows=10 width=61) (actual time=147.524..178.232 rows=1 loops=1) Output: c1 Recheck Cond: (test001.c1 ~~ '%燋邢%'::text) Rows Removed by Index Recheck: 99999 Heap Blocks: exact=1137 Buffers: shared hit=14429 -> Bitmap Index Scan on idx_test001_1 (cost=0.00..7615669.08 rows=10 width=0) (actual time=147.377..147.377 rows=100000 loops=1) Index Cond: (test001.c1 ~~ '%燋邢%'::text) Buffers: shared hit=13292 Planning time: 0.133 ms Execution time: 178.265 ms (11 rows)
PostgreSQL 正则匹配的语法为 字符串 ~ 'pattern' 或 字符串 ~* 'pattern'
例子
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 ~ '12[0-9]{3,9}'; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------ Bitmap Heap Scan on public.test001 (cost=65.08..75.20 rows=10 width=61) (actual time=0.196..0.196 rows=0 loops=1) Output: c1 Recheck Cond: (test001.c1 ~ '12[0-9]{3,9}'::text) Rows Removed by Index Recheck: 1 Heap Blocks: exact=1 Buffers: shared hit=50 -> Bitmap Index Scan on idx_test001_1 (cost=0.00..65.08 rows=10 width=0) (actual time=0.183..0.183 rows=1 loops=1) Index Cond: (test001.c1 ~ '12[0-9]{3,9}'::text) Buffers: shared hit=49 Planning time: 0.452 ms Execution time: 0.221 ms (11 rows) test01=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 ~ '宸朾啣' collate "zh_CN"; QUERY PLAN -----------------------------------------------------------------------------------------------------------------------Bitmap Heap Scan on public.test001 (cost=6.58..19.42 rows=10 width=61) (actual time=0.061..0.061 rows=1 loops=1) Output: c1 Recheck Cond: (test001.c1 ~ '宸朾啣'::text COLLATE "zh_CN") Heap Blocks: exact=1 Buffers: shared hit=5 -> Bitmap Index Scan on idx_test001_1 (cost=0.00..6.58 rows=10 width=0) (actual time=0.049..0.049 rows=1 loops=1) Index Cond: (test001.c1 ~ '宸朾啣'::text COLLATE "zh_CN") Buffers: shared hit=4Planning time: 0.238 msExecution time: 0.082 ms(10 rows)
正则匹配索引原理参考contrib/pg_trgm/trgm_regexp.c
pg_trgm模糊查询的原理
首先,pg_trgm将字符串的前端添加2个空格,末尾添加1个空格。然后,每连续的3个字符为一个TOKEN,拆开。
最后,对TOKEN建立GIN倒排索引。
查看字符串的TOKEN,可以使用如下方法。
test=# select show_trgm('123'); show_trgm ------------------------- {" 1"," 12",123,"23 "} (1 row)
pg_trgm前后模糊字符个数要求的原因
使用pg_trgm时,如果要获得最好的效果,最好满足这些条件。原因是什么呢?
因为pg_trgm生成的TOKEN是三个字符,只有在以上三个条件下,才能匹配到对应的TOKEN。
test=# select show_trgm('123'); show_trgm ------------------------- {" 1"," 12",123,"23 "} (1 row)
当需要前后模糊搜索1个或者2个字符时,pg_trgm无法满足需求,但是我们可以使用表达式GIN索引。
使用表达式,将字符串拆成1个单字,两个连续的字符的数组,对数组建立GIN索引即可。
例子
test=# create or replace function split001(text) returns text[] as $$ declare res text[]; begin select regexp_split_to_array($1,'') into res; for i in 1..length($1)-1 loop res := array_append(res, substring($1,i,2)); end loop; return res; end; $$language plpgsql strict immutable; CREATE FUNCTION test=# create index idx_test001_2 on test001 using gin (split001(c1)); test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where split001(c1) @> array['你好']; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------ Bitmap Heap Scan on public.test001 (cost=8.87..550.12 rows=500 width=61) (actual time=0.041..0.041 rows=0 loops=1) Output: c1 Recheck Cond: (split001(test001.c1) @> '{你好}'::text[]) Buffers: shared hit=4 -> Bitmap Index Scan on idx_test001_2 (cost=0.00..8.75 rows=500 width=0) (actual time=0.039..0.039 rows=0 loops=1) Index Cond: (split001(test001.c1) @> '{你好}'::text[]) Buffers: shared hit=4 Planning time: 0.104 ms Execution time: 0.068 ms (9 rows) test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where split001(c1) @> array['你']; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on public.test001 (cost=8.87..550.12 rows=500 width=61) (actual time=0.063..0.183 rows=86 loops=1) Output: c1 Recheck Cond: (split001(test001.c1) @> '{你}'::text[]) Heap Blocks: exact=80 Buffers: shared hit=84 -> Bitmap Index Scan on idx_test001_2 (cost=0.00..8.75 rows=500 width=0) (actual time=0.048..0.048 rows=86 loops=1) Index Cond: (split001(test001.c1) @> '{你}'::text[]) Buffers: shared hit=4 Planning time: 0.101 ms Execution time: 0.217 ms (10 rows) test=# select * from test001 where split001(c1) @> array['你']; c1 ------------------------------------------ 殐踨洪冨垓丩贤閚偉垢胸鍘崩你萭隡劭芛雫袰 靅慨热脸罆淓寘鰻总襎戍謸枨陪丼倫柆套你仮 ......
模糊查询和正则匹配都是找出完全符合条件的记录,还有一种需求是相似查询。
例如postgresql字符串,输入 p0stgresgl 也能根据相似度匹配到。
这里同样用到了pg_trgm插件,如果要支持中文,同样有这样的要求:
如果需要让pg_trgm支持中文相似查询,数据库lc_ctype不能为"C",只有TOKEN分割正确效果才是OK的。(lc_ctype设置正确,才能够正确的逐一分割多字节字符串中的文字: Character classification (What is a letter? Its upper-case equivalent?))。
建议输入3个或3个以上字符,否则效果不佳(后面会分析原因)。
例子
test=# create index idx_test001_3 on test001 using gist (c1 gist_trgm_ops); CREATE INDEX test=# explain (analyze,verbose,timing,costs,buffers) SELECT t, c1 <-> '癷磛鹚蠌鳃蠲123鶡埀婎鳊苿奶垨惸溴蔻筴熝憡' AS dist FROM test001 t ORDER BY dist LIMIT 5; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.28..0.52 rows=5 width=89) (actual time=37.462..37.639 rows=5 loops=1) Output: t.*, ((c1 <-> '癷磛鹚蠌鳃蠲123鶡埀婎鳊苿奶垨惸溴蔻筴熝憡'::text)) Buffers: shared hit=1631 -> Index Scan using idx_test001_3 on public.test001 t (cost=0.28..4763.28 rows=100000 width=89) (actual time=37.461..37.636 rows=5 loops=1) Output: t.*, (c1 <-> '癷磛鹚蠌鳃蠲123鶡埀婎鳊苿奶垨惸溴蔻筴熝憡'::text) Order By: (t.c1 <-> '癷磛鹚蠌鳃蠲123鶡埀婎鳊苿奶垨惸溴蔻筴熝憡'::text) Buffers: shared hit=1631 Planning time: 0.089 ms Execution time: 37.668 ms (9 rows) test=# SELECT t, c1 <-> '癷磛鹚蠌鳃蠲123鶡埀婎鳊苿奶垨惸溴蔻筴熝憡' AS dist FROM test001 t ORDER BY dist LIMIT 5; t | dist --------------------------------------------+---------- (癷磛鹚蠌鳃蠲你鶡埀婎鳊苿奶垨惸溴蔻筴熝憡) | 0.307692 (坆桻悁斾耾瑚豌腏炁悿隖轲盃挜稐睟礓蜮铅湆) | 0.976744 (癷鉜餯祂鼃恫蝅瓟顡廕梍蛸歡僷贊敔欓侑韌鐹) | 0.976744 (癷嚯鳬戚蹪熼胘檙佌欔韜挹樷覄惶蹝顼鑜鞖媗) | 0.976744 (癷饎瞲餿堒歃峽盾豼擔禞嵪豦咢脉馄竨济隘缄) | 0.976744 (5 rows)
1亿条记录,每条记录15个随机中文。测试前后模糊查询性能。
vi test.sql insert into test001 select gen_hanzi(15) from generate_series(1,2500000); pgbench -n -r -P 1 -f ./test.sql -c 40 -j 40 -t 1 test test=# select count(*) from test001; count ----------- 100000000 (1 row) test=# select * from test001 limit 10; c1 -------------------------------- 釾笉皜鰈确艄騚馺腃彊釲忰采汦擇 槮搮圮墔婂蹾飘孡鶒镇赀聵線麯櫕 孨鄈韞萅赫炧暤蟠檼駧餪崉娲譌筯 烸喖醝稦怩鷟棾奜妛曫仾飛饡绘韦 撑豁襉峊炠眏罱襄彊鰮莆壏妒辷阛 蜁愊鶱磹贰帵眲嚉榑苍潵檐簄椰魨 瑄翁蠃巨躋壾蛸湗鑂顂櫟砣八癱栵 馇巍笿鞒装棊嘢恓煓熴锠鋈蹃煿屓 訆韄踔牤嘇糺絢軿鵑燿螛梋鰢謇郼 撲蓨伤釱糕觩嬖蓷鰼繩圆醷熌靉掑 (10 rows)
创建索引
test=# set maintenance_work_mem ='32GB'; test=# create index idx_test001_1 on test001 using gin (c1 gin_trgm_ops);
表和索引大小
test=# \di+ List of relations Schema | Name | Type | Owner | Table | Size | Description --------+---------------+-------+----------+---------+-------+------------- public | idx_test001_1 | index | postgres | test001 | 12 GB | (1 row) test=# \dt+ List of relations Schema | Name | Type | Owner | Size | Description --------+---------+-------+----------+---------+------------- public | test001 | table | postgres | 7303 MB | (1 row)
3.1 前模糊
响应时间:9毫秒
返回4701行
select * from test001 where c1 like '你%'; test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '你%'; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------ Bitmap Heap Scan on public.test001 (cost=89.50..10161.50 rows=10000 width=46) (actual time=1.546..8.868 rows=4701 loops=1) Output: c1 Recheck Cond: (test001.c1 ~~ '你%'::text) Rows Removed by Index Recheck: 85 Heap Blocks: exact=4776 Buffers: shared hit=4784 -> Bitmap Index Scan on idx_test001_1 (cost=0.00..87.00 rows=10000 width=0) (actual time=0.879..0.879 rows=4786 loops=1) Index Cond: (test001.c1 ~~ '你%'::text) Buffers: shared hit=8 Planning time: 0.099 ms Execution time: 9.166 ms (11 rows)
3.2 后模糊
响应时间:0.25毫秒
返回2行
select * from test001 where c1 like '%靉掑'; test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '%靉掑'; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on public.test001 (cost=89.50..10161.50 rows=10000 width=46) (actual time=0.049..0.223 rows=2 loops=1) Output: c1 Recheck Cond: (test001.c1 ~~ '%靉掑'::text) Rows Removed by Index Recheck: 87 Heap Blocks: exact=89 Buffers: shared hit=94 -> Bitmap Index Scan on idx_test001_1 (cost=0.00..87.00 rows=10000 width=0) (actual time=0.031..0.031 rows=89 loops=1) Index Cond: (test001.c1 ~~ '%靉掑'::text) Buffers: shared hit=5 Planning time: 0.113 ms Execution time: 0.249 ms (11 rows)
3.3 前后模糊
响应时间:0.2毫秒
返回1行
select * from test001 where c1 like '%螛梋鰢%'; test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '%螛梋鰢%'; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on public.test001 (cost=89.50..10161.50 rows=10000 width=46) (actual time=0.044..0.175 rows=1 loops=1) Output: c1 Recheck Cond: (test001.c1 ~~ '%螛梋鰢%'::text) Rows Removed by Index Recheck: 81 Heap Blocks: exact=82 Buffers: shared hit=87 -> Bitmap Index Scan on idx_test001_1 (cost=0.00..87.00 rows=10000 width=0) (actual time=0.027..0.027 rows=82 loops=1) Index Cond: (test001.c1 ~~ '%螛梋鰢%'::text) Buffers: shared hit=5 Planning time: 0.112 ms Execution time: 0.201 ms (11 rows)
三字或以上模糊查询,使用pg_trgm可以很好的解决。
1,2个字的模糊查询,使用表达式索引也可以实现很好的性能。
postgres=# create or replace function split_12(text) returns text[] as $$declare res text[];begin select regexp_split_to_array($1, '') into res; for i in 1..length($1)-1 loop res := array_append(res, substring($1, i, 2)); end loop; return res;end;$$ language plpgsql strict immutable;CREATE FUNCTIONpostgres=# select split_12('abc你好'); split_12 ------------------------------ {a,b,c,你,好,ab,bc,c你,你好}(1 row)create index idx2 on tbl using gin (split_12(col)); select * from tbl where split_12(col) @> array['单字或双字'];
建议应用在查询的时候,判断一下词的个数,选择正确的SQL写法。
转载地址:http://lhuyo.baihongyu.com/