How do I get HTML text from a web page without having it rendered?
When you say you just want the text, how do you mean — of a particular element? Or all text, with the tags removed?
Minitech it is rendered. I run a php script using ajax and then echo what should be html source code back into my regular page from the script. Instead of getting html or xml style source code it gives me the pictures, css formatting, everything. I just want the view-source code
Also I did something really stupid. I outputted the buffer using foo.HTML(data I recieved back) which obviously turned everything to html when I should have been using foo.TEXT(data I receieved). Hours wasted!!
4 Answers 4
I’ll assume you’re just doing echo $html and assuming it’s rendered somehow. It’s not. Look at it in plain text instead:
header("Content-Type: text/plain"); echo $html;
And if by «rendered» you mean «ASP.NET rendered the page into HTML», no, you can’t get the source of arbitrary remote pages. That would be a pretty big security risk.
@Leon: Mmm. doesn’t for me. Are you getting any warnings? Is there any content before your opening
@Leon: Ah, it’s using Ajax. See, details like that are important. You can keep the PHP exactly as it is, but you need to do the actual «escaping» in JavaScript. Please post your Ajax request.
@Leon: That’s the bad way to do it, which is why I asked. You should do the escaping client-side, or you have to transfer about 2.5x as much data with less extensibility.
Erm. all file_get_contents does is get the contents of the file. It only looks like «rendered» HTML because you’re dumping it right into the output stream. Try running it through htmlspecialchars before outputting it.
Look at the comment I gave to minitech. How do I just get the xml style tags you see when you right click and view source??
I agree with @Kolink. Something like this will work:
$html = htmlspecialchars(file_get_contents('http://stackoverflow.com/questions/ask'));
Not tested, though pretty confidant!
What do you mean by «just the text»? Do you want to scrape the text content of the html file? Then you should try to parse the file, i.e. filtering the tags that contain interesting content with a tool like simplehtmldom (look for the tab «extract content from html»). Or write your own parser and, if necessary, strip the remaining tags from the content with php’s own strip_tags.
This question is in a collective: a subcommunity defined by tags with relevant content and experts.