Mais c'est vrai que ça semble plus logique de convertir œ en oe.
]]>Il n'est pas dit qu'il fera toujours ça, il y a un patch en cours là-dessus qui transforme les ligatures en plusieurs lettres
https://commitfest.postgresql.org/6/301/
en rapport avec: BUG #13440: unaccent does not remove all diacritics
...et la discussion associée montre que le comportement attendu de cette fonction ne tombe pas sous le sens.
]]>set client_min_messages='WARNING';
drop database if exists test_pg_perl;
create database test_pg_perl;
\c test_pg_perl
create extension unaccent;
create extension plperl;
create extension plperlu;
create or replace function perl_unaccent(param_charset text, param_string text) returns text as $$
use Text::Unaccent;
return unac_string($_[0],$_[1]);
$$ language plperlu immutable strict;
create or replace function perl_unaccent_with_bigmap(param_string text) returns text as $$
# Based on:
# http://cpansearch.perl.org/src/PJACKLAM/Text-Unaccent-PurePerl-0.05/lib/Text/Unaccent/PurePerl.pm
# Author: Peter John Acklam
# Time-stamp: 2013-03-02 12:38:55 +00:00
# E-mail: pjacklam@online.no
# URL: http://home.online.no/~pjacklam
if (!defined $_SHARED) {
$_SHARED={
# 00A0 NO-BREAK SPACE
# -> 0020 SPACE
"\xA0" => " ",
--------------------(CUT)--------------------
# FFEE HALFWIDTH WHITE CIRCLE
# -> 25CB WHITE CIRCLE
"\x{FFEE}" => "\x{25CB}"
}
};
my $str_in=$_[0];
my $str_out='';
my $offset_max=length($str_in)-1;
for my $offset (0..$offset_max) {
my $chr=substr($str_in,$offset,1);
$str_out.=exists $_SHARED->{$chr} ? $_SHARED->{$chr} : $chr;
};
return $str_out;
$$ language plperl immutable strict;
create or replace function perl_unicode_normalize(param_string text) returns text as $$
use Unicode::Normalize;
my $w=NFD($_[0]);
$w =~ s/\pM//g; # strip combining characters
return $w;
$$ language plperlu immutable strict;
\set str2unacc '''Ça va, lʼœil de Владимир a été passé au Kärcher !'''
\echo Test sur la chaine: :str2unacc
select unaccent(:str2unacc);
select perl_unaccent('utf-8',:str2unacc);
select perl_unaccent_with_bigmap(:str2unacc);
select perl_unicode_normalize(:str2unacc);
\set loops 5000
\echo Test sur :loops appels:
explain analyze select unaccent(:str2unacc||seq) from generate_series(1,:loops) as tbl(seq);
explain analyze select perl_unaccent('utf-8',:str2unacc||seq) from generate_series(1,:loops) as tbl(seq);
explain analyze select perl_unaccent_with_bigmap(:str2unacc||seq) from generate_series(1,:loops) as tbl(seq);
explain analyze select perl_unicode_normalize(:str2unacc||seq) from generate_series(1,:loops) as tbl(seq);
Ce qui donne:
Test sur la chaine: 'Ça va, lʼœil de Владимир a été passé au Kärcher !'
unaccent
---------------------------------------------------
Ca va, lʼeil de Владимир a ete passe au Karcher !
perl_unaccent
----------------------------------------------------------------------------
Ca va, lʼÅ\u0093il de Ð\u0092ладимиÑ\u0080 a ete passe au Karcher !
perl_unaccent_with_bigmap
---------------------------------------------------
Ca va, lʼœil de Владимир a ete passe au Karcher !
perl_unicode_normalize
---------------------------------------------------
Ca va, lʼœil de Владимир a ete passe au Karcher !
(1 ligne)
Test sur 5000 appels:
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Function Scan on generate_series tbl (cost=0.00..20.00 rows=1000 width=4) (actual time=0.528..11.804 rows=5000 loops=1)
Total runtime: 12.073 ms
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Function Scan on generate_series tbl (cost=0.00..267.50 rows=1000 width=4) (actual time=0.509..55.731 rows=5000 loops=1)
Total runtime: 56.385 ms
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Function Scan on generate_series tbl (cost=0.00..267.50 rows=1000 width=4) (actual time=0.469..164.947 rows=5000 loops=1)
Total runtime: 165.531 ms
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Function Scan on generate_series tbl (cost=0.00..267.50 rows=1000 width=4) (actual time=0.455..92.159 rows=5000 loops=1)
Total runtime: 92.847 ms
Si je commente le test "defined" (donc en réaffectant le tableau SHARED à chaque fois, ça repasse de 165 à ~6500ms !
]]>use Unicode::Normalize;
my $w=NFD($_[0]);
$w =~ s/\pM//g; # strip combining characters
return $w;
A confirmer par vos propres tests, mais ça me semble aussi deux fois plus rapide que unac_string() de Text::Unaccent::PurePerl
]]>Via "use Text::Unaccent::PurePerl" (plperlU) il serait initialisé une seule fois.
Pour rester en plperl (trusted), j'imagine qu'il faudrait le mettre dans $_SHARED dans une étape à part appelée une seule fois.
]]>En attendant, j'ai un peu benchmarké ce qui était dispo (à savoir l'extension unaccent, la lib perl unaccent et le source perl unaccent_pureperl, en pg9.1.16 sur une debian wheezy):
drop database if exists test_pg_perl;
create database test_pg_perl;
\c test_pg_perl
create extension unaccent;
create extension plperl;
create extension plperlu;
create or replace function perl_unaccent(param_charset text, param_string text) returns text as $$
use Text::Unaccent;
return unac_string($_[0],$_[1]);
$$ language plperlu immutable strict;
create or replace function perl_unaccent_pure(param_string text) returns text as $$
# http://cpansearch.perl.org/src/PJACKLAM/Text-Unaccent-PurePerl-0.05/lib/Text/Unaccent/PurePerl.pm
# Author: Peter John Acklam
# Time-stamp: 2013-03-02 12:38:55 +00:00
# E-mail: pjacklam@online.no
# URL: http://home.online.no/~pjacklam
my $map = {
# 00A0 NO-BREAK SPACE
# -> 0020 SPACE
"\xA0" => " ",
...
########################### bon là, j'ai viré tout le tableau associatif pour le paste...
...
# FFEE HALFWIDTH WHITE CIRCLE
# -> 25CB WHITE CIRCLE
"\x{FFEE}" => "\x{25CB}"
};
my $str_in=$_[0];
# Iterate over each character in the input string. If the character exists
# in the map, replace the current character according to the map, otherwise
# keep the character as it is.
my $str_out = '';
my $offset_max = length($str_in) - 1;
for my $offset (0 .. $offset_max) {
my $chr = substr($str_in, $offset, 1);
$str_out .= exists $map->{$chr} ? $map->{$chr} : $chr;
};
return $str_out;
$$ language plperl;
\set str2unacc '''Mon œil à été passé au Kärcher'''
\set loops 5000
explain analyze select unaccent(:str2unacc) from generate_series(1,:loops);
explain analyze select perl_unaccent('utf-8',:str2unacc) from generate_series(1,:loops);
explain analyze select perl_unaccent_pure(:str2unacc) from generate_series(1,:loops);
Voici les résultats:
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Function Scan on generate_series (cost=0.00..12.50 rows=1000 width=0) (actual time=0.579..7.934 rows=5000 loops=1)
Total runtime: 8.207 ms
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Function Scan on generate_series (cost=0.00..10.00 rows=1000 width=0) (actual time=0.406..0.840 rows=5000 loops=1)
Total runtime: 1.078 ms
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Function Scan on generate_series (cost=0.00..260.00 rows=1000 width=0) (actual time=2.207..7790.519 rows=5000 loops=1)
Total runtime: 7791.658 ms
Le problème a été vu dès 2006
https://rt.cpan.org/Public/Bug/Display.html?id=21177
mais l'auteur de Text::Unaccent avait apparemment déjà abandonné son package et ce rapport de bugs (et d'autres d'ailleurs) sont restés lettre morte.
Au vu de la backtrace, je pense que c'est que ça qui plante et que ça n'a rien à voir avec postgres ou plperl. Il se peut qu'avec la debian d'avant ça passait à travers le bug mais il y a un caractère aléatoire, tout dépend de ce qui se trouve en pile à cet endroit là. S'il y a une variable qui ne sert plus, ça passe. S'il y a un pointeur passé à free(), ça segfault. Or le compilateur met les variables en pile dans l'ordre qu'il veut.
Idéalement sur CPAN il faudrait qu'un autre mainteneur prenne la main et applique les patches, mais je ne crois pas que CPAN ait une politique de récupération des modules à l'abandon. Peut-être qu'au niveau de Debian?
Tu peux régler le problème au niveau de ton installation en compilant toi-même, ou sinon utiliser autre chose. Vu comment est fait Unaccent, je soupçonne que des puristes d'Unicode diraient que c'est n'importe quoi de toute manière.
]]>Éventuellement, il y a une librairie unaccent en pure-perl… ça a même l'avantage de pouvoir copier-coller tout le code dans la procédure (avec le map) et d'en faire du plperl au lieu de plperlu.
]]>#0 __GI___libc_free (mem=0x7fda00000000) at malloc.c:2929
2929 malloc.c: No such file or directory.
(gdb) bt full
#0 __GI___libc_free (mem=0x7fda00000000) at malloc.c:2929
ar_ptr = <optimized out>
p = <optimized out>
hook = 0x0
#1 0x00007fda3da30ff4 in unac_string () from /usr/lib/x86_64-linux-gnu/perl5/5.20/auto/Text/Unaccent/Unaccent.so
No symbol table info available.
#2 0x00007fda3da3027e in ?? () from /usr/lib/x86_64-linux-gnu/perl5/5.20/auto/Text/Unaccent/Unaccent.so
No symbol table info available.
#3 0x00007fda3df3da0b in Perl_pp_entersub () from /usr/lib/x86_64-linux-gnu/libperl.so.5.20
No symbol table info available.
#4 0x00007fda3df36276 in Perl_runops_standard () from /usr/lib/x86_64-linux-gnu/libperl.so.5.20
No symbol table info available.
#5 0x00007fda3debf455 in Perl_call_sv () from /usr/lib/x86_64-linux-gnu/libperl.so.5.20
No symbol table info available.
#6 0x00007fda3e2394ef in plperl_call_perl_func (desc=desc@entry=0x7fda4de0a540, fcinfo=fcinfo@entry=0x7fda4dd4ade0)
at /tmp/buildd/postgresql-9.4-9.4.3/build/../src/pl/plperl/plperl.c:2120
sp = <optimized out>
retval = <optimized out>
i = <optimized out>
count = <optimized out>
__func__ = "plperl_call_perl_func"
#7 0x00007fda3e23e09f in plperl_func_handler (fcinfo=0x7fda4dd4ade0) at /tmp/buildd/postgresql-9.4-9.4.3/build/../src/pl/plperl/plperl.c:2310
retval = 0
prodesc = 0x7fda4de0a540
perlret = <optimized out>
rsi = 0x0
pl_error_context = {previous = 0x0, callback = 0x7fda3e233c60 <plperl_exec_callback>, arg = 0x7fda4dde2bc0}
#8 plperl_call_handler (fcinfo=0x7fda4dd4ade0) at /tmp/buildd/postgresql-9.4-9.4.3/build/../src/pl/plperl/plperl.c:1757
save_exception_stack = 0x7ffda8a45690
save_context_stack = 0x0
local_sigjmp_buf = {{__jmpbuf = {140575585381856, 1075903644315876123, 140727432795168, 140575585381248, 140575585381744, 140575585381904, 1075903644504619803,
1054209159195589403}, __mask_was_saved = 0, __saved_mask = {__val = {16384, 1, 0, 140575585349048, 0, 140727432795088, 140575557409410, 2, 0, 140575584584928,
13924289827103274240, 25, 13924289827103274240, 16389, 13924289827103274240, 140575584823600}}}}
retval = <optimized out>
save_call_data = 0x0
oldinterp = 0x0
this_call_data = {prodesc = 0x7fda4de0a540, fcinfo = 0x7fda4dd4ade0, tuple_store = 0x0, ret_tdesc = 0x0, tmp_cxt = 0x0}
#9 0x00007fda4c353743 in ExecMakeFunctionResultNoSets (fcache=0x7fda4dd4ad70, econtext=0x7fda4dd4ab80, isNull=0x7fda4dd4b7c8 "", isDone=<optimized out>)
at /tmp/buildd/postgresql-9.4-9.4.3/build/../src/backend/executor/execQual.c:2026
arg = <optimized out>
result = <optimized out>
fcinfo = 0x7fda4dd4ade0
fcusage = {fs = 0x0, save_f_total_time = {tv_sec = 140575585381744, tv_usec = 140575584823712}, save_total = {tv_sec = 140575585381248, tv_usec = 140575585384392}, f_start = {
tv_sec = 140575585384672, tv_usec = 140727432795280}}
i = <optimized out>
#10 0x00007fda4c3597ad in ExecTargetList (isDone=0x7ffda8a45524, itemIsDone=0x7fda4dd4b8e0, isnull=0x7fda4dd4b7c8 "", values=0x7fda4dd4b7b0, econtext=0x7fda4dd4ab80,
targetlist=0x7fda4dd4b8b0) at /tmp/buildd/postgresql-9.4-9.4.3/build/../src/backend/executor/execQual.c:5307
gstate = <optimized out>
tle = <optimized out>
resind = <optimized out>
---Type <return> to continue, or q <return> to quit---
tl = 0x7fda4dd4b890
haveDoneSets = 0 '\000'
#11 ExecProject (projInfo=<optimized out>, isDone=isDone@entry=0x7ffda8a45524) at /tmp/buildd/postgresql-9.4-9.4.3/build/../src/backend/executor/execQual.c:5522
slot = 0x7fda4dd4ac60
econtext = 0x7fda4dd4ab80
numSimpleVars = <optimized out>
#12 0x00007fda4c36bdf8 in ExecResult (node=node@entry=0x7fda4dd4aa70) at /tmp/buildd/postgresql-9.4-9.4.3/build/../src/backend/executor/nodeResult.c:155
outerTupleSlot = <optimized out>
resultSlot = <optimized out>
outerPlan = <optimized out>
econtext = 0x7fda4dd4ab80
isDone = ExprSingleResult
#13 0x00007fda4c3525f8 in ExecProcNode (node=node@entry=0x7fda4dd4aa70) at /tmp/buildd/postgresql-9.4-9.4.3/build/../src/backend/executor/execProcnode.c:373
result = <optimized out>
__func__ = "ExecProcNode"
#14 0x00007fda4c34f80e in ExecutePlan (dest=0x7fda4dd0bee0, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, planstate=0x7fda4dd4aa70,
estate=0x7fda4dd4a960) at /tmp/buildd/postgresql-9.4-9.4.3/build/../src/backend/executor/execMain.c:1490
slot = <optimized out>
current_tuple_count = 0
#15 standard_ExecutorRun (queryDesc=0x7fda4dd4a550, direction=<optimized out>, count=0) at /tmp/buildd/postgresql-9.4-9.4.3/build/../src/backend/executor/execMain.c:319
estate = 0x7fda4dd4a960
operation = CMD_SELECT
dest = 0x7fda4dd0bee0
sendTuples = <optimized out>
#16 0x00007fda4c44b23f in PortalRunSelect (portal=portal@entry=0x7fda4dd48540, forward=forward@entry=1 '\001', count=0, count@entry=9223372036854775807, dest=dest@entry=0x7fda4dd0bee0)
at /tmp/buildd/postgresql-9.4-9.4.3/build/../src/backend/tcop/pquery.c:946
queryDesc = 0x7fda4dd4a550
direction = <optimized out>
nprocessed = <optimized out>
__func__ = "PortalRunSelect"
#17 0x00007fda4c44c8b0 in PortalRun (portal=portal@entry=0x7fda4dd48540, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1 '\001', dest=dest@entry=0x7fda4dd0bee0,
altdest=altdest@entry=0x7fda4dd0bee0, completionTag=completionTag@entry=0x7ffda8a45ac0 "") at /tmp/buildd/postgresql-9.4-9.4.3/build/../src/backend/tcop/pquery.c:790
save_exception_stack = 0x7ffda8a45970
save_context_stack = 0x0
local_sigjmp_buf = {{__jmpbuf = {140575585122216, 1075903644334750491, 140575585371456, 140575585124064, 140575585122488, 2, 1075903644378790683, 1054310065454381851},
__mask_was_saved = 0, __saved_mask = {__val = {140727432795975, 7124309760, 140575560164752, 140575561544582, 64, 140727432795952, 88, 140575585371456, 140575561268835,
140575585122488, 2, 140727432795984, 140575560255786, 2, 140575585371456, 140727432796016}}}}
result = <optimized out>
nprocessed = <optimized out>
saveTopTransactionResourceOwner = 0x7fda4dcfef50
saveTopTransactionContext = 0x7fda4dcfee40
saveActivePortal = 0x0
saveResourceOwner = 0x7fda4dcfef50
savePortalContext = 0x0
saveMemoryContext = 0x7fda4dcfee40
__func__ = "PortalRun"
#18 0x00007fda4c449f8b in exec_simple_query (query_string=0x7fda4dd0ab40 "SELECT unaccent('utf-8','héhé');") at /tmp/buildd/postgresql-9.4-9.4.3/build/../src/backend/tcop/postgres.c:1072
parsetree = 0x7fda4dd0b7a8
portal = 0x7fda4dd48540
snapshot_set = <optimized out>
---Type <return> to continue, or q <return> to quit---
commandTag = <optimized out>
completionTag = "\000%\312M\332\177\000\000\240w\310M\332\177\000\000Hx\310M\332\177\000\000P\000\000\000\000\000\000\000\250%\312M\332\177\000\000\342%\312M\332\177\000\000\020[\244\250\375\177\000\000\374\251\070L\332\177\000"
querytree_list = <optimized out>
plantree_list = 0x7fda4dcc2c10
receiver = 0x7fda4dd0bee0
format = 0
dest = DestRemote
parsetree_list = 0x7fda4dd0b8d8
save_log_statement_stats = 0 '\000'
was_logged = 0 '\000'
msec_str = "\200[\244\250\375\177\000\000\"\001\000\000\000\000\000\000(g\244\250\375\177\000\000\300f\314M\332\177\000"
parsetree_item = 0x7fda4dd0b8b8
isTopLevel = 1 '\001'
#19 PostgresMain (argc=<optimized out>, argv=argv@entry=0x7fda4dc87880, dbname=0x7fda4dc87730 "postgres", username=<optimized out>)
at /tmp/buildd/postgresql-9.4-9.4.3/build/../src/backend/tcop/postgres.c:4074
query_string = 0x7fda4dd0ab40 "SELECT unaccent('utf-8','héhé');"
firstchar = 1305773376
input_message = {data = 0x7fda4dd0ab40 "SELECT unaccent('utf-8','héhé');", len = 35, maxlen = 1024, cursor = 35}
local_sigjmp_buf = {{__jmpbuf = {140727432796432, 1075903644735306523, 1, 140575584581392, 140575584839360, 140575564674304, 1075903644332653339, 1054310067531086619},
__mask_was_saved = 1, __saved_mask = {__val = {0, 4294967295, 140575529333856, 0, 140575529333856, 0, 140727432796672, 18446744073709551615, 0, 4294967295, 0, 140578574565375,
140575564145920, 0, 140575529333856, 0}}}}
send_ready_for_query = 0 '\000'
__func__ = "PostgresMain"
#20 0x00007fda4c211829 in BackendRun (port=0x7fda4dcc66c0) at /tmp/buildd/postgresql-9.4-9.4.3/build/../src/backend/postmaster/postmaster.c:4164
ac = 1
secs = 493483548
usecs = 296945
i = 1
av = 0x7fda4dc87880
maxac = <optimized out>
#21 BackendStartup (port=0x7fda4dcc66c0) at /tmp/buildd/postgresql-9.4-9.4.3/build/../src/backend/postmaster/postmaster.c:3829
bn = <optimized out>
pid = <optimized out>
#22 ServerLoop () at /tmp/buildd/postgresql-9.4-9.4.3/build/../src/backend/postmaster/postmaster.c:1597
rmask = {fds_bits = {256, 0 <repeats 15 times>}}
selres = <optimized out>
readmask = {fds_bits = {448, 0 <repeats 15 times>}}
now = <optimized out>
last_touch_time = 1440168308
__func__ = "ServerLoop"
#23 0x00007fda4c3f38ae in PostmasterMain (argc=5, argv=<optimized out>) at /tmp/buildd/postgresql-9.4-9.4.3/build/../src/backend/postmaster/postmaster.c:1244
opt = <optimized out>
status = <optimized out>
userDoption = <optimized out>
listen_addr_saved = 1 '\001'
i = <optimized out>
output_config_variable = <optimized out>
__func__ = "PostmasterMain"
#24 0x00007fda4c212998 in main (argc=5, argv=0x7fda4dc86540) at /tmp/buildd/postgresql-9.4-9.4.3/build/../src/backend/main/main.c:228
Détail amusant, il ne semble pas y avoir malloc.c dans les libs de debug debian.
]]>