Php preg match utf8

Содержание

preg_match и UTF-8 в PHP
7 ответов
Pattern Modifiers
User Contributed Notes 12 notes

preg_match и UTF-8 в PHP

Это должно печатать 1, так как «H» находится в индексе 1 в строке «¡Hola!». Но он печатает 2. Поэтому кажется, что он не рассматривает тему как кодированную в UTF8 строку, хотя я передаю «u» modifier в регулярном выражении. У меня есть следующие настройки в моем php.ini, и работают другие функции UTF8:

mbstring.func_overload = 7 mbstring.language = Neutral mbstring.internal_encoding = UTF-8 mbstring.http_input = pass mbstring.http_output = pass mbstring.encoding_translation = Off

7 ответов

Переключатель ‘u’ имеет смысл только для pcre, сам PHP не знает об этом.

С точки зрения PHP, строки являются байтовыми последовательностями, и возвращаемое смещение кажется логичным (я не говорю «правильно»).

Имейте в виду, что те же «правила», касающиеся обработки utf-8, применяются к 5-му параметру $offset . Пример: var_dump(preg_match(‘/#/u’, «\xc3\xa4#»,$matches,0,2));

php знает о модификаторе u, который указан в руководстве, см. «u (PCRE_UTF8)» php.net/manual/en/reference.pcre.pattern.modifiers.php

Хотя модификатор u позволяет интерпретировать как шаблон, так и субъект как UTF-8, захваченные смещения все еще подсчитываются в байтах.

Вы можете использовать mb_strlen для получения длины в символах UTF-8, а не в байтах:

$str = "\xC2\xA1Hola!"; preg_match('/H/u', $str, $a_matches, PREG_OFFSET_CAPTURE); echo mb_strlen(substr($str, 0, $a_matches[0][1]));

«Модификатор u предназначен только для интерпретации шаблона как UTF-8, а не субъекта». Это неправда. Сравните, например, preg_split(‘//’, .) preg_split(‘//u’, .) preg_split(‘//’, .) С preg_split(‘//u’, .) . Поскольку это «x интерпретируется как UTF-8», является немного расплывчатым, посмотрите это для фактических эффектов режима Unicode.

@pathros Сейчас 2016, и PHP все еще ужасно сосет в Юникоде . и PHP7 ничего не изменил в этом случае.

a · bys · mal·ly /əˈbizməlē/ (adverb) : очень плохо; ужасающе. Кроме того, @TomaszKowalczyk, это 2017 год, и PHP все еще ужасно отстой в Unicode.

@tomalak и следующие. Конечно, php не управляет юникодом, потому что он работает с байтами, если вы используете старые функции, такие как substr, strlen и т. Д., Но он полностью управляется с очень долгого времени через расширение mbstring, включенное по умолчанию во многих дистрибутивах и сервера. Это выбор для поддержания обратной совместимости.

У меня НЕТ ПРОБЛЕМ с UTF-8 в PHP с тех пор, как я начал конвертировать все свои старые сайты в Unicode 4-5 лет назад.

Один из величайших ответов на SO. Я потратил довольно много времени на то, чтобы вырвать мои волосы Спасибо вам большое!

@Tomalak Томалак «Чувак, сейчас 2019 год, а PHP все еще ужасно отстой в Unicode.» Пожалуйста подтвердите.

Попробуйте добавить это (* UTF8) перед регулярным выражением:

preg_match('(*UTF8)/H/u', "\xC2\xA1Hola!", $a_matches, PREG_OFFSET_CAPTURE);

Интересно, хотя я думаю, что вам нужно начальное / до (*UTF8) . Это не работает в моей системе, но может в других. Что это echo $a_matches[0][1]; когда вы делаете echo $a_matches[0][1]; ?

Я использовал это так в PHP 5.4.29, работает как шарм: preg_match_all(‘/(*UTF8)[^A-Za-z0-9\s]/’, $txt, $matches);

У меня не работает ни на PHP 5.6, ни на PHP 7 в Ubuntu 16.04. (*UTF8) до того, как разделитель является ошибкой, после не имеет никакого эффекта. Я подозреваю, что это зависит от того, как / где вы получили свой php, в частности от настроек, с libpcre* была скомпилирована libpcre* .

Не меняет смещения для меня, но это интересно знать. Исходная документация для этой «функции»: pcre.org/pcre.txt

Извините меня за некропостинг, но может быть кто-то сочтет это полезным: нижеприведенный код может работать как замена для функций preg_match и preg_match_all и возвращает правильные совпадения со смещением правильное для строк с кодировкой UTF8.

 mb_internal_encoding('UTF-8'); /** * Returns array of matches in same format as preg_match or preg_match_all * @param bool $matchAll If true, execute preg_match_all, otherwise preg_match * @param string $pattern The pattern to search for, as a string. * @param string $subject The input string. * @param int $offset The place from which to start the search (in bytes). * @return array */ function pregMatchCapture($matchAll, $pattern, $subject, $offset = 0) < $matchInfo = array(); $method = 'preg_match'; $flag = PREG_OFFSET_CAPTURE; if ($matchAll) < $method .= '_all'; >$n = $method($pattern, $subject, $matchInfo, $flag, $offset); $result = array(); if ($n !== 0 && !empty($matchInfo)) < if (!$matchAll) < $matchInfo = array($matchInfo); >foreach ($matchInfo as $matches) < $positions = array(); foreach ($matches as $match) < $matchedText = $match[0]; $matchedLength = $match[1]; $positions[] = array( $matchedText, mb_strlen(mb_strcut($subject, 0, $matchedLength)) ); >$result[] = $positions; > if (!$matchAll) < $result = $result[0]; >> return $result; > $s1 = 'Попробуем русскую строку для теста'; $s2 = 'Try english string for test'; var_dump(pregMatchCapture(true, '/обу/', $s1)); var_dump(pregMatchCapture(false, '/обу/', $s1)); var_dump(pregMatchCapture(true, '/lish/', $s2)); var_dump(pregMatchCapture(false, '/lish/', $s2));

 array(1) < [0]=>array(1) < [0]=>array(2) < [0]=>string(6) "обу" [1]=> int(4) > > > array(1) < [0]=>array(2) < [0]=>string(6) "обу" [1]=> int(4) > > array(1) < [0]=>array(1) < [0]=>array(2) < [0]=>string(4) "lish" [1]=> int(7) > > > array(1) < [0]=>array(2) < [0]=>string(4) "lish" [1]=> int(7) > >

Можете ли вы объяснить, что делает ваш код, вместо того, чтобы просто вставлять дамп кода? И как это отвечает на вопрос?

Он делает именно то, что описано в комментариях, и возвращает ПРАВИЛЬНЫЕ смещения строк. Это предмет вопроса. Понятия не имею, почему у меня было -2 для моего ответа. Это работает для меня.

Ну, вот почему вы должны включить объяснение того, что делает ваш код. Люди не понимают, что вы пытаетесь сделать здесь.

Я написал небольшой класс для преобразования смещений, возвращаемых preg_match, в соответствующие смещения utf:

final class NonUtfToUtfOffset < /** @var int[] */ private $utfMap = []; public function __construct(string $content) < $contentLength = mb_strlen($content); for ($offset = 0; $offset < $contentLength; $offset ++) < $char = mb_substr($content, $offset, 1); $nonUtfLength = strlen($char); for ($charOffset = 0; $charOffset < $nonUtfLength; $charOffset ++) < $this->utfMap[] = $offset; > > > public function convertOffset(int $nonUtfOffset): int < return $this->utfMap[$nonUtfOffset]; > >

Вы можете использовать его так:

$content = 'aą bać d'; $offsetConverter = new NonUtfToUtfOffset($content); preg_match_all('#(bać)#ui', $content, $m, PREG_OFFSET_CAPTURE); foreach ($m[1] as [$word, $offset]) < echo "bad: " . mb_substr($content, $offset, mb_strlen($word))."\n"; echo "good: " . mb_substr($content, $offsetConverter->convertOffset($offset), mb_strlen($word))."\n"; >

Источник

Pattern Modifiers

The current possible PCRE modifiers are listed below. The names in parentheses refer to internal PCRE names for these modifiers. Spaces and newlines are ignored in modifiers, other characters cause error.

i ( PCRE_CASELESS ) If this modifier is set, letters in the pattern match both upper and lower case letters. m ( PCRE_MULTILINE ) By default, PCRE treats the subject string as consisting of a single «line» of characters (even if it actually contains several newlines). The «start of line» metacharacter (^) matches only at the start of the string, while the «end of line» metacharacter ($) matches only at the end of the string, or before a terminating newline (unless D modifier is set). This is the same as Perl. When this modifier is set, the «start of line» and «end of line» constructs match immediately following or immediately before any newline in the subject string, respectively, as well as at the very start and end. This is equivalent to Perl’s /m modifier. If there are no «\n» characters in a subject string, or no occurrences of ^ or $ in a pattern, setting this modifier has no effect. s ( PCRE_DOTALL ) If this modifier is set, a dot metacharacter in the pattern matches all characters, including newlines. Without it, newlines are excluded. This modifier is equivalent to Perl’s /s modifier. A negative class such as [^a] always matches a newline character, independent of the setting of this modifier. x ( PCRE_EXTENDED ) If this modifier is set, whitespace data characters in the pattern are totally ignored except when escaped or inside a character class, and characters between an unescaped # outside a character class and the next newline character, inclusive, are also ignored. This is equivalent to Perl’s /x modifier, and makes it possible to include commentary inside complicated patterns. Note, however, that this applies only to data characters. Whitespace characters may never appear within special character sequences in a pattern, for example within the sequence (?( which introduces a conditional subpattern. A ( PCRE_ANCHORED ) If this modifier is set, the pattern is forced to be «anchored», that is, it is constrained to match only at the start of the string which is being searched (the «subject string»). This effect can also be achieved by appropriate constructs in the pattern itself, which is the only way to do it in Perl. D ( PCRE_DOLLAR_ENDONLY ) If this modifier is set, a dollar metacharacter in the pattern matches only at the end of the subject string. Without this modifier, a dollar also matches immediately before the final character if it is a newline (but not before any other newlines). This modifier is ignored if m modifier is set. There is no equivalent to this modifier in Perl. S When a pattern is going to be used several times, it is worth spending more time analyzing it in order to speed up the time taken for matching. If this modifier is set, then this extra analysis is performed. At present, studying a pattern is useful only for non-anchored patterns that do not have a single fixed starting character. U ( PCRE_UNGREEDY ) This modifier inverts the «greediness» of the quantifiers so that they are not greedy by default, but become greedy if followed by ? . It is not compatible with Perl. It can also be set by a ( ?U ) modifier setting within the pattern or by a question mark behind a quantifier (e.g. .*? ).

Note:

It is usually not possible to match more than pcre.backtrack_limit characters in ungreedy mode.

X ( PCRE_EXTRA ) This modifier turns on additional functionality of PCRE that is incompatible with Perl. Any backslash in a pattern that is followed by a letter that has no special meaning causes an error, thus reserving these combinations for future expansion. By default, as in Perl, a backslash followed by a letter with no special meaning is treated as a literal. There are at present no other features controlled by this modifier. J ( PCRE_INFO_JCHANGED ) The (?J) internal option setting changes the local PCRE_DUPNAMES option. Allow duplicate names for subpatterns. As of PHP 7.2.0 J is supported as modifier as well. u ( PCRE_UTF8 ) This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern and subject strings are treated as UTF-8. An invalid subject will cause the preg_* function to match nothing; an invalid pattern will trigger an error of level E_WARNING. Five and six octet UTF-8 sequences are regarded as invalid.

User Contributed Notes 12 notes

Regarding the validity of a UTF-8 string when using the /u pattern modifier, some things to be aware of;

1. If the pattern itself contains an invalid UTF-8 character, you get an error (as mentioned in the docs above — «UTF-8 validity of the pattern is checked since PHP 4.3.5»

2. When the subject string contains invalid UTF-8 sequences / codepoints, it basically result in a «quiet death» for the preg_* functions, where nothing is matched but without indication that the string is invalid UTF-8

3. PCRE regards five and six octet UTF-8 character sequences as valid (both in patterns and the subject string) but these are not supported in Unicode ( see section 5.9 «Character Encoding» of the «Secure Programming for Linux and Unix HOWTO» — can be found at http://www.tldp.org/ and other places )

4. For an example algorithm in PHP which tests the validity of a UTF-8 string (and discards five / six octet sequences) head to: http://hsivonen.iki.fi/php-utf8/

The following script should give you an idea of what works and what doesn’t;

$examples = array(
‘Valid ASCII’ => «a» ,
‘Valid 2 Octet Sequence’ => «\xc3\xb1» ,
‘Invalid 2 Octet Sequence’ => «\xc3\x28» ,
‘Invalid Sequence Identifier’ => «\xa0\xa1» ,
‘Valid 3 Octet Sequence’ => «\xe2\x82\xa1» ,
‘Invalid 3 Octet Sequence (in 2nd Octet)’ => «\xe2\x28\xa1» ,
‘Invalid 3 Octet Sequence (in 3rd Octet)’ => «\xe2\x82\x28» ,

‘Valid 4 Octet Sequence’ => «\xf0\x90\x8c\xbc» ,
‘Invalid 4 Octet Sequence (in 2nd Octet)’ => «\xf0\x28\x8c\xbc» ,
‘Invalid 4 Octet Sequence (in 3rd Octet)’ => «\xf0\x90\x28\xbc» ,
‘Invalid 4 Octet Sequence (in 4th Octet)’ => «\xf0\x28\x8c\x28» ,
‘Valid 5 Octet Sequence (but not Unicode!)’ => «\xf8\xa1\xa1\xa1\xa1» ,
‘Valid 6 Octet Sequence (but not Unicode!)’ => «\xfc\xa1\xa1\xa1\xa1\xa1» ,
);

echo «++Invalid UTF-8 in pattern\n» ;
foreach ( $examples as $name => $str ) echo » $name \n» ;
preg_match ( «/» . $str . «/u» , ‘Testing’ );
>

echo «++ preg_match() examples\n» ;
foreach ( $examples as $name => $str )

preg_match ( «/\xf8\xa1\xa1\xa1\xa1/u» , $str , $ar );
echo » $name : » ;

if ( count ( $ar ) == 0 ) echo «Matched nothing!\n» ;
> else echo «Matched < $ar [ 0 ]>\n» ;
>

echo «++ preg_match_all() examples\n» ;
foreach ( $examples as $name => $str ) preg_match_all ( ‘/./u’ , $str , $ar );
echo » $name : » ;

$num_utf8_chars = count ( $ar [ 0 ]);
if ( $num_utf8_chars == 0 ) echo «Matched nothing!\n» ;
> else echo «Matched $num_utf8_chars character\n» ;
>

Источник