(.*)

Содержание

file_get_contents
Parameters
Return Values
Errors/Exceptions
Changelog
Examples
Notes
See Also
User Contributed Notes 6 notes
Как получить контент сайта на PHP
Получение контента с помощью библиотеки SimpleHTMLDOM
Получение контента с помощью cURL
Продвинутый скрипт получения контента на PHP

file_get_contents

This function is similar to file() , except that file_get_contents() returns the file in a string , starting at the specified offset up to length bytes. On failure, file_get_contents() will return false .

file_get_contents() is the preferred way to read the contents of a file into a string. It will use memory mapping techniques if supported by your OS to enhance performance.

Note:

If you’re opening a URI with special characters, such as spaces, you need to encode the URI with urlencode() .

Parameters

Note:

The FILE_USE_INCLUDE_PATH constant can be used to trigger include path search. This is not possible if strict typing is enabled, since FILE_USE_INCLUDE_PATH is an int . Use true instead.

A valid context resource created with stream_context_create() . If you don’t need to use a custom context, you can skip this parameter by null .

The offset where the reading starts on the original stream. Negative offsets count from the end of the stream.

Seeking ( offset ) is not supported with remote files. Attempting to seek on non-local files may work with small offsets, but this is unpredictable because it works on the buffered stream.

Maximum length of data read. The default is to read until end of file is reached. Note that this parameter is applied to the stream processed by the filters.

Return Values

The function returns the read data or false on failure.

This function may return Boolean false , but may also return a non-Boolean value which evaluates to false . Please read the section on Booleans for more information. Use the === operator for testing the return value of this function.

Errors/Exceptions

An E_WARNING level error is generated if filename cannot be found, length is less than zero, or if seeking to the specified offset in the stream fails.

When file_get_contents() is called on a directory, an E_WARNING level error is generated on Windows, and as of PHP 7.4 on other operating systems as well.

Changelog

Version	Description
8.0.0	length is nullable now.
7.1.0	Support for negative offset s has been added.

Examples

Example #1 Get and output the source of the homepage of a website

Example #2 Searching within the include_path

// If strict types are enabled i.e. declare(strict_types=1);
$file = file_get_contents ( ‘./people.txt’ , true );
// Otherwise
$file = file_get_contents ( ‘./people.txt’ , FILE_USE_INCLUDE_PATH );
?>

Example #3 Reading a section of a file

// Read 14 characters starting from the 21st character
$section = file_get_contents ( ‘./people.txt’ , FALSE , NULL , 20 , 14 );
var_dump ( $section );
?>

The above example will output something similar to:

Example #4 Using stream contexts

// Create a stream
$opts = array(
‘http’ =>array(
‘method’ => «GET» ,
‘header’ => «Accept-language: en\r\n» .
«Cookie: foo=bar\r\n»
)
);

$context = stream_context_create ( $opts );

// Open the file using the HTTP headers set above
$file = file_get_contents ( ‘http://www.example.com/’ , false , $context );
?>

Notes

Note: This function is binary-safe.

A URL can be used as a filename with this function if the fopen wrappers have been enabled. See fopen() for more details on how to specify the filename. See the Supported Protocols and Wrappers for links to information about what abilities the various wrappers have, notes on their usage, and information on any predefined variables they may provide.

When using SSL, Microsoft IIS will violate the protocol by closing the connection without sending a close_notify indicator. PHP will report this as «SSL: Fatal Protocol Error» when you reach the end of the data. To work around this, the value of error_reporting should be lowered to a level that does not include warnings. PHP can detect buggy IIS server software when you open the stream using the https:// wrapper and will suppress the warning. When using fsockopen() to create an ssl:// socket, the developer is responsible for detecting and suppressing this warning.

User Contributed Notes 6 notes

file_get_contents can do a POST, create a context for that first:

$opts = array( ‘http’ =>
array(
‘method’ => ‘POST’ ,
‘header’ => «Content-Type: text/xml\r\n» .
«Authorization: Basic » . base64_encode ( » $https_user : $https_password » ). «\r\n» ,
‘content’ => $body ,
‘timeout’ => 60
)
);

$context = stream_context_create ( $opts );
$url = ‘https://’ . $https_server ;
$result = file_get_contents ( $url , false , $context , — 1 , 40000 );

Note that if an HTTP request fails but still has a response body, the result is still false, Not the response body which may have more details on why the request failed.

There’s barely a mention on this page but the $http_response_header will be populated with the HTTP headers if your file was a link. For example if you’re expecting an image you can do this:

$mimetype = null ;
foreach ( $http_response_header as $v ) if ( preg_match ( ‘/^content\-type:\s*(image\/[^;\s\n\r]+)/i’ , $v , $m )) $mimetype = $m [ 1 ];
>
>

if (! $mimetype ) // not an image
>

if the connection is
content-encoding: gzip
and you need to manually ungzip it, this is apparently the key
$c=gzinflate( substr($c,10,-8) );
(stolen from the net)

//从指定位置获取指定长度的文件内容
function file_start_length($path,$start=0,$length=null) if(!file_exists($path)) return false;
$size=filesize($path);
if($start <0) $start+=$size;
if($length===null) $length=$size-$start;
return file_get_contents($path, false, null, $start, $length );
>

I’m not sure why @jlh was downvoted, but I verified what he reported.

>>> file_get_contents($path false, null, 5, null)
=> «»
>>> file_get_contents($path, false, null, 5, 5)
=> «r/bin»

Источник

Как получить контент сайта на PHP

Парсер контента на языке PHP – это важный инструмент для веб-разработчиков, которые работают с различными источниками данных. Он позволяет извлекать нужную информацию из HTML-страниц, XML-файлов и других форматов, а также обрабатывать ее в соответствии с заданными правилами.

Одним из основных преимуществ парсера контента является возможность автоматизировать процесс получения и обработки данных, что позволяет сократить время выполнения задач и уменьшить вероятность ошибок.

Для получения контента определённой страницы сайта есть простое решение с помощью собственной функции php — file_get_contents . Всё, что требуется это передать в функцию URL нужной страницы.

Получение контента с помощью библиотеки SimpleHTMLDOM

Для более качественной работы функции лучше воспользоваться подключаемой библиотекой SimpleHTMLDOM. В simplehtmldom есть методы для удаленной загрузки страниц. После подключения файла библиотеки, нам доступны 2 функции для обработки HTML строк:

str_get_html(str) и file_get_html(url)

Они делают одно и тоже, преобразуют HTML текст в DOM дерево, различаются лишь источники.

str_get_htm – на вход получает обычную строку, т.е. если вы получили HTML прибегнув к curl, или file_get_contents то вы просто передаете полученный текст этой функции.

$html = str_get_html(‘html код’);

file_get_html – сама умеет загружать данные с удаленного URL или из локального файла

К сожалению, file_get_html загружает страницы обычным file_get_contents . Это значит если хостер, выставил в php.ini allow_url_fopen = false (т.е. запретил удаленно открывать файлы), то загрузить что-то удаленно, не получится. Да и серьезные веб сайты таким способом парсить не стоит, лучше использовать CURL с поддержкой proxy и ssl .

$html = file_get_html(‘http://www.yandex.ru/’);
в результате, в переменной $html будет объект типа simple_html_dom.

При больших объемах данных, в библиотеке происходит утечка памяти. Поэтому после окончания одного цикла надо ее чистить.

Делает это метод clear .

К примеру грузим 5 раз сайт www.yandex.ru с разными поисковыми запросами

include 'simple_html_dom.php'; $k = 5; while($k>0)< $html = file_get_html('http://yandex.ru/yandsearch?text=hi'.$k.'&lr=11114'); // загружаем данные // как-то их обрабатываем $html->clear(); // подчищаем за собой unset($html); $k--; >

Ниже приведен ещё один пример использования библиотеки Simple HTML DOM Parser для парсинга HTML-страницы и извлечения заголовков новостей:

// Подключаем библиотеку require_once('simple_html_dom.php'); // Получаем содержимое страницы $html = file_get_html('http://example.com/news.html'); // Ищем все заголовки новостей foreach($html->find('h2.news-title') as $title) < // Выводим текст заголовка echo $title->plaintext; >

В этом примере мы используем библиотеку Simple HTML DOM Parser , которая предоставляет простой и удобный API для работы с HTML-документами. Сначала мы получаем содержимое страницы с помощью функции file_get_html() , затем находим все элементы с тегом h2 и классом news-title с помощью метода find() . Наконец, мы выводим текст каждого заголовка с помощью свойства plaintext .

Получение контента с помощью cURL

Неоспоримыми преимуществами в функционале пользуется библиотека или можно сказать модуль PHP — cURL . Для полноценного контролируемого получения контента здесь есть множество разных доплнений. Это и практически полноценный эмулятор браузерного обращения к сайту, работа скрипта через proxy с приватной идентификацией и многое другое. Ниже показана функция получения контента с помощью cURL .

function curl($url, $postdata='', $cookie='', $proxy='') < $uagent = "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_7; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.205 Safari/534.16"; $ch = curl_init( $url ); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); // возвращает веб-страницу curl_setopt($ch, CURLOPT_HEADER, 0); // возвращает заголовки @curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); // переходит по редиректам curl_setopt($ch, CURLOPT_ENCODING, ""); // обрабатывает все кодировки curl_setopt($ch, CURLOPT_USERAGENT, $uagent); // useragent curl_setopt($ch, CURLOPT_TIMEOUT, 10); // таймаут ответа //curl_setopt($ch, CURLOPT_MAXREDIRS, 10); // останавливаться после 10-ого редиректа if($proxy !='')if($mass_proxy[2]!="" and $mass_proxy[3]!="")< curl_setopt($ch, CURLOPT_PROXYUSERPWD,$mass_proxy[2].':'.$mass_proxy[3]);>// если необходимо предоставить имя пользователя и пароль 'user:pass' if(!empty($postdata)) < curl_setopt($ch, CURLOPT_POST, 1); curl_setopt($ch, CURLOPT_POSTFIELDS, $postdata); >if(!empty($cookie)) < //curl_setopt($ch, CURLOPT_COOKIEJAR, $_SERVER['DOCUMENT_ROOT'].'/2.txt'); //curl_setopt($ch, CURLOPT_COOKIEFILE,$_SERVER['DOCUMENT_ROOT'].'/2.txt'); >$content = curl_exec( $ch ); $err = curl_errno( $ch ); $errmsg = curl_error( $ch ); $header = curl_getinfo( $ch ); curl_close( $ch ); $header['errno'] = $err; $header['errmsg'] = $errmsg; $header['content'] = $content; return $header['content']; >

Продвинутый скрипт получения контента на PHP

Ещё один способ получения контента встроенными функциями php я нашёл на просторах интернета. Пока не использовал, но привожу здесь для полноты картины, так как это решение тоже заслуживает достойного внимания. Далее, как в ©Источнике:
Полезными качествами, в данном контексте, будут возможность получения множества атрибутов запрашиваемого контента, а также возможность получения заголовка ответа сервера и времени выполнения запроса. Данная функция использует встроенные в PHP функции для работы с сокетами, которые предназначены для соединения клиента с сервером.

; $URL_SCHEME = ( isset($URL_PARTS['scheme']))?$URL_PARTS['scheme']:'http'; $URL_HOST = ( isset($URL_PARTS['host']))?$URL_PARTS['host']:''; $URL_PATH = ( isset($URL_PARTS['path']))?$URL_PARTS['path']:'/'; $URL_PORT = ( isset($URL_PARTS['port']))?intval($URL_PARTS['port']):80; if( isset($URL_PARTS['query']) && $URL_PARTS['query']!='' ) < $URL_PATH .= '?'.$URL_PARTS['query']; >; $URL_PORT_REQUEST = ( $URL_PORT == 80 )?'':":$URL_PORT"; //--- build GET request --- $USER_AGENT = ( $user_agent == null )?'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)':strval($user_agent); $GET_REQUEST = "GET $URL_PATH HTTP/1.0 " ."Host: $URL_HOST$URL_PORT_REQUEST " ."Accept: text/plain " ."Accept-Encoding: identity " ."User-Agent: $USER_AGENT "; //--- open socket --- $SOCKET_TIME_OUT = 30; $SOCKET = @fsockopen($URL_HOST, $URL_PORT, $ERROR_NO, $ERROR_STR, $SOCKET_TIME_OUT); if( $SOCKET ) < if( fputs($SOCKET, $GET_REQUEST)) < socket_set_timeout($SOCKET, $SOCKET_TIME_OUT); //--- read header --- $header = ''; $SOCKET_STATUS = socket_get_status($SOCKET); while( !feof($SOCKET) && !$SOCKET_STATUS['timed_out'] ) < $temp = fgets($SOCKET, 128); if( trim($temp) == '' ) break; $header .= $temp; $SOCKET_STATUS = socket_get_status($SOCKET); >; //--- get server code --- if( preg_match('~HTTP/(d+.d+)s+(d+)s+(.*)s* ~si', $header, $res)) $SERVER_CODE = $res[2]; else break; if( $SERVER_CODE == ABI_URL_STATUS_OK ) < //--- read content --- $content = ''; $SOCKET_STATUS = socket_get_status($SOCKET); while( !feof($SOCKET) && !$SOCKET_STATUS['timed_out'] ) < $content .= fgets($SOCKET, 1024*8); $SOCKET_STATUS = socket_get_status($SOCKET); >; //--- time results --- $TIME_END = explode(' ', microtime()); $TIME_TOTAL = ($TIME_END[0]+$TIME_END[1])-($TIME_START[0]+$TIME_START[1]); //--- output --- $URL_RESULT['header'] = $header; $URL_RESULT['content'] = $content; $URL_RESULT['time'] = $TIME_TOTAL; $URL_RESULT['description'] = ''; $URL_RESULT['keywords'] = ''; //--- title --- $URL_RESULT['title'] =( preg_match('~~U', $content, $res))?strval($res[1]):''; //--- meta tags --- if( preg_match_all('~]+>~', $content, $res, PREG_SET_ORDER) > 0 ) < foreach($res as $meta) $URL_RESULT[strtolower($meta[1])] = $meta[2]; >; > elseif( $SERVER_CODE == ABI_URL_STATUS_REDIRECT_301 || $SERVER_CODE == ABI_URL_STATUS_REDIRECT_302 ) < if( preg_match('~location:s*(.*?) ~si', $header, $res)) < $REDIRECT_URL = rtrim($res[1]); $URL_PARTS = @parse_url($REDIRECT_URL); if( isset($URL_PARTS['scheme'])&& isset($URL_PARTS['host'])) $url = $REDIRECT_URL; else $url = $URL_SCHEME.'://'.$URL_HOST.'/'.ltrim($REDIRECT_URL, '/'); >else < break; >; >; >;// GET request is OK fclose($SOCKET); >// socket open is OK else < break; >; $TRY_ID++; > while( $TRY_ID ; ?>

Итак, входящими параметрами являются: $url — строка, содержащая URL http-протокола, $user_agent — строка с любым юзер-агентом (если пропустить параметр или установить его в null — user_agent будет как в IE). Константа MAX_REDIRECTS_NUM устанавливает количество разрешенных редиректов (поддерживаются 301 и 302 редиректы).

Теперь перейдем к примерам практического использования этой функции:

 else print 'Запрашиваемая страница недоступна.'; ?>

Как видно из вышеприведенного примера, мы можем получить всю информацию по запрошенному URL . Кроме того, можно получить значения любого мета-тега. Для этого можно воспользоваться следующим кодом:

Парсер контента на языке PHP – это важный инструмент для получения и обработки данных из различных источников. Благодаря мощным библиотекам и инструментам, разработчики могут легко и удобно извлекать нужную информацию из HTML-страниц, XML-файлов и других форматов.

Дата публикации: 2019-04-26