Php simple html dom plaintext

Содержание

Saved searches
Use saved searches to filter your results more quickly
License
simplehtmldom/simplehtmldom
Name already in use
Sign In Required
Launching GitHub Desktop
Launching GitHub Desktop
Launching Xcode
Launching Visual Studio Code
Latest commit
Git stats
Files
README.md
About
Quick Start
Read plain text from HTML document
Read plaint text from HTML string
Read specific elements from HTML document
Modify HTML documents
Collect information from Slashdot
PHP HTML DOM парсер с jQuery подобными селекторами
Parsing documents
DOM methods & properties
Element methods & properties
DOM traversing
Camel naming conventions

Saved searches

Use saved searches to filter your results more quickly

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

This is a mirror of the Simple HTML DOM Parser at

License

simplehtmldom/simplehtmldom

This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

Git stats

Files

Failed to load latest commit information.

README.md

PHP Simple HTML DOM Parser

simplehtmldom is a fast and reliable HTML DOM parser for PHP.

Purely PHP-based DOM parser (no XML extensions required).
Works with well-formed and broken HTML documents.
Loads webpages, local files and document strings.
Supports CSS selectors.

simplehtmldom requires PHP 5.6 or higher with ext-iconv enabled. Following extensions enable additional features of the parser:

ext-mbstring (recommended)
Enables better detection for multi-byte documents.
ext-curl
Enables cURL support for the class HtmlWeb .
ext-openssl (recommended when using cURL)
Enables SSL support for cURL.

Download the latest release from SourceForge and extract the files in the vendor folder of your project.

composer require simplehtmldom/simplehtmldom

git clone git://git.code.sf.net/p/simplehtmldom/repository simplehtmldom

Note: The GitHub repository serves as a mirror for the SourceForge project. We currently accept pull requests and issues only via SourceForge.

This example illustrates how to return the page title:

load('https://www.google.com/search?q=simplehtmldom'); // Returns the page title echo $html->find('title', 0)->plaintext . PHP_EOL;

load('https://www.google.com/search?q=simplehtmldom'); // Returns the page title echo $html->find('title', 0)->plaintext . PHP_EOL;

Find more examples in the installation folder under examples .

The documentation for this library is hosted at https://simplehtmldom.sourceforge.io/docs/

There are various ways for you to get involved with simplehtmldom. Here are a few:

Share this project with your friends (Twitter, Facebook, . you name it. ).
Report bugs (SourceForge).
Request features (SourceForge).
Discuss existing bugs, features and ideas.

If you want to contribute code to the project, please open a feature request and include your patch with the message.

The source code for simplehtmldom is licensed under the MIT license. For further information read the LICENSE file in the root directory (should be located next to this README file).

simplehtmldom is a purely PHP-based DOM parser that doesn’t rely on external libraries like libxml, SimpleXML or PHP DOM. Doing so provides better control over the parsing algorithm and a much simpler API that even novice users can learn to use in a short amount of time.

About

This is a mirror of the Simple HTML DOM Parser at

Источник

Quick Start

Find below sample code that demonstrate the fundamental features of PHP Simple HTML DOM Parser.

Read plain text from HTML document

echo file_get_html('https://www.google.com/')->plaintext;

Loads the specified HTML document into memory, parses it and returns the plain text. Note that file_get_html supports local files as well as remote files!

Read plaint text from HTML string

Parses the provided HTML string and returns the plain text. Note that the parser handles partial documents as well as full documents.

Read specific elements from HTML document

$html = file_get_html('https://www.google.com/'); foreach($html->find('img') as $element) echo $element->src . '
'; foreach($html->find('a') as $element) echo $element->href . '
';

Loads the specified document into memory and returns a list of image sources as well as anchor links. Note that find supports CSS selectors to find elements in the DOM.

Modify HTML documents

$doc = ' find('div', 1)->class = 'bar'; $html->find('div[id=hello]', 0)->innertext = 'foo'; echo $html; // Parses the provided HTML string and replaces elements in the DOM before returning the updated HTML string. In this example, the class for the second div element is set to bar and the inner text for the first div element to foo .

Note that find supports a second parameter to return a single element from the array of matches.

Note that attributes can be accessed directly by the means of magic methods ( ->class and ->innertext in the example above).

Collect information from Slashdot

$html = file_get_html('https://slashdot.org/'); $articles = $html->find('article[data-fhtype="story"]'); foreach($articles as $article) < $item['title'] = $article->find('.story-title', 0)->plaintext; $item['intro'] = $article->find('.p', 0)->plaintext; $item['details'] = $article->find('.details', 0)->plaintext; $items[] = $item; > print_r($items);

Collects information from Slashdot for further processing.

Note that the combination of CSS selectors and magic methods make the process of parsing HTML documents a simple task that is easy to understand.

Источник

PHP HTML DOM парсер с jQuery подобными селекторами

Добрый день, уважаемые хабровчане. В данном посте речь пойдет о совместном проекте S. C. Chen и John Schlick под названием PHP Simple HTML DOM Parser (ссылки на sourceforge).

Идея проекта — создать инструмент позволяющий работать с html кодом используя jQuery подобные селекторы. Оригинальная идея принадлежит Jose Solorzano’s и реализована для php четвертой версии. Данный же проект является более усовершенствованной версией базирующейся на php5+.

В обзоре будут представлены краткие выдержки из официального мануала, а также пример реализации парсера для twitter. Справедливости ради, следует указать, что похожий пост уже присутствует на habrahabr, но на мой взгляд, содержит слишком малое количество информации. Кого заинтересовала данная тема, добро пожаловать под кат.

Получение html кода страницы

$html = file_get_html('http://habrahabr.ru/'); //работает и с https://

Товарищ Fedcomp дал полезный комментарий насчет file_get_contents и 404 ответа. Оригинальный скрипт при запросе к 404 странице не возвращает ничего. Чтобы исправить ситуацию, я добавил проверку на get_headers. Доработанный скрипт можно взять тут.

Поиск элемента по имени тега

foreach($html->find('img') as $element) < //выборка всех тегов img на странице echo $element->src . '
'; // построчный вывод содержания всех найденных атрибутов src >

Модификация html элементов

$html = str_get_html('Hello
World
'); // читаем html код из строки (file_get_html() - из файла) $html->find('div', 1)->class = 'bar'; // присвоить элементу div с порядковым номером 1 класс "bar" $html->find('div[id=hello]', 0)->innertext = 'foo'; // записать в элемент div с текст foo echo $html; // выведет foo
World

Получение текстового содержания элемента (plaintext)

echo file_get_html('http://habrahabr.ru/')->plaintext;

Целью статьи не является предоставить исчерпывающую документацию по данному скрипту, подробное описание всех возможностей вы можете найти в официальном мануале, если у сообщества возникнет желание, я с удовольствием переведу весь мануал на русский язык, пока же приведу обещанный в начале статьи пример парсера для twitter.

Пример парсера сообщений из twitter

require_once 'simple_html_dom.php'; // библиотека для парсинга $username = 'habrahabr'; // Имя в twitter $maxpost = '5'; // к-во постов $html = file_get_html('https://twitter.com/' . $username); $i = '0'; foreach ($html->find('li.expanding-stream-item') as $article) < //выбираем все li сообщений $item['text'] = $article->find('p.js-tweet-text', 0)->innertext; // парсим текст сообщения в html формате $item['time'] = $article->find('small.time', 0)->innertext; // парсим время в html формате $articles[] = $item; // пишем в массив $i++; if ($i == $maxpost) break; // прерывание цикла >

Вывод сообщений

 for ($j = 0; $j < $maxpost; $j++) < echo ''; echo '' . $articles[$j]['text'] . '
'; echo '' . $articles[$j]['time'] . '
'; echo '
'; >

Благодарю за внимание. Надеюсь, получилось не очень тяжеловесно и легко для восприятия.

Parsing documents

The parser accepts documents in the form of URLs, files and strings. The document must be accessible for reading and cannot exceed MAX_FILE_SIZE .

Name	Description
str_get_html( string $content ) : object	Creates a DOM object from string.
file_get_html( string $filename ) : object	Creates a DOM object from file or URL.

DOM methods & properties

Name	Description
__construct( [string $filename] ) : void	Constructor, set the filename parameter will automatically load the contents, either text or file/url.
plaintext : string	Returns the contents extracted from HTML.
clear() : void	Clean up memory.
load( string $content ) : void	Load contents from string.
save( [string $filename] ) : string	Dumps the internal DOM tree back into a string. If the $filename is set, result string will save to file.
load_file( string $filename ) : void	Load contents from a file or a URL.
set_callback( string $function_name ) : void	Set a callback function.
find( string $selector [, int $index] ) : mixed	Find elements by the CSS selector. Returns the Nth element object if index is set, otherwise return an array of object.

Element methods & properties

Name	Description
[attribute] : string	Read or write element’s attribute value.
tag : string	Read or write the tag name of element.
outertext : string	Read or write the outer HTML text of element.
innertext : string	Read or write the inner HTML text of element.
plaintext : string	Read or write the plain text of element.
find( string $selector [, int $index] ) : mixed	Find children by the CSS selector. Returns the Nth element object if index is set, otherwise return an array of object.

DOM traversing

Name	Description
$e->children( [int $index] ) : mixed	Returns the Nth child object if index is set, otherwise return an array of children.
$e->parent() : element	Returns the parent of element.
$e->first_child() : element	Returns the first child of element, or null if not found.
$e->last_child() : element	Returns the last child of element, or null if not found.
$e->next_sibling() : element	Returns the next sibling of element, or null if not found.
$e->prev_sibling() : element	Returns the previous sibling of element, or null if not found.

Camel naming conventions

Method	Mapping
$e->getAllAttributes()	$e->attr
$e->getAttribute( $name )	$e->attribute
$e->setAttribute( $name, $value)	$value = $e->attribute
$e->hasAttribute( $name )	isset($e->attribute)
$e->removeAttribute ( $name )	$e->attribute = null
$e->getElementById ( $id )	$e->find ( «#$id», 0 )
$e->getElementsById ( $id [,$index] )	$e->find ( «#$id» [, int $index] )
$e->getElementByTagName ($name )	$e->find ( $name, 0 )
$e->getElementsByTagName ( $name [, $index] )	$e->find ( $name [, int $index] )
$e->parentNode ()	$e->parent ()
$e->childNodes ( [$index] )	$e->children ( [int $index] )
$e->firstChild ()	$e->first_child ()
$e->lastChild ()	$e->last_child ()
$e->nextSibling ()	$e->next_sibling ()
$e->previousSibling ()	$e->prev_sibling ()

Источник

Php simple html dom plaintext

Saved searches

Use saved searches to filter your results more quickly

License

simplehtmldom/simplehtmldom

Name already in use

Sign In Required

Launching GitHub Desktop

Launching GitHub Desktop

Launching Xcode

Launching Visual Studio Code

Latest commit

Git stats

Files

README.md

About

Quick Start

Read plain text from HTML document

Read plaint text from HTML string

Read specific elements from HTML document

Modify HTML documents

Collect information from Slashdot

PHP HTML DOM парсер с jQuery подобными селекторами

Получение html кода страницы

Поиск элемента по имени тега

Модификация html элементов

Получение текстового содержания элемента (plaintext)

Пример парсера сообщений из twitter

Вывод сообщений

Похожие библиотеки

Parsing documents

DOM methods & properties

Element methods & properties

DOM traversing

Camel naming conventions