Read Text on webpage

Hello everyone.

Is possible to read the text displayed on a webpage?

Later i could filter what is important or no… I see that in the html (inquiry on google chorome) the text is spread by various files, is not easy to “compile” it.

Is there a simple way, like give a url and associate the text (displayed on screen, even in txt simple format) to a string (…)??

Thanks in advance.

You can use the network.request() API call to fetch the contents of a URL. For a web page, you’re going to get the whole page source including the <head> and full body.  You would need to parse through all of the HTML to get to the text that you want. Depending on the structure of the web page, this could be easy or it could be hard. See: https://docs.coronalabs.com/api/library/network/request.html

HTML is XML and you can find Lua based XML parsers (we have one in the Business App sample - https://github.com/coronalabs-samples/business-app-sample) that can make parsing the HTML page easier and you get a Lua table out of it.

Rob

Hello.

Another question, i used the sample code, works fins, printing the contents on console.

But i can´t put in a variable, so i can “decode” or “filter” it on other part of the program.

why?? The event.response is what?

if i put somethig like

text=event.response the variable text is null outside the function, even if i make it global…

Thanks in advance.

– The following sample code contacts Google’s encrypted search over SSL

– and prints the response (in this case, the HTML source of the home page)

– to the Corona terminal.

 

local function networkListener( event )

 

    if ( event.isError ) then

        print( "Network error: ", event.response )

    else

        print ( "RESPONSE: " … event.response )

    end

end

 

– Access Google over SSL:

network.request( “https://encrypted.google.com”, “GET”, networkListener )

Can you post your code using the <> code tag in the row with Bold and Italic (copy and paste your code from your text editor).  You talk above about variable named “text” but the code you posted doesn’t have that code.

Thanks

Rob

Hello again.

I´ve changed the code, so i put a corona doc example, my code is on down.

Also strange that it executes the line with the print “_____” and later the print included in the function…

Sorry, it could be a very simple problem…

local function networkListener( event ) if ( event.isError ) then print( "Network error: ", event.response ) else --print ( "RESPONSE: " .. event.response ) \_G.text=event.response print("it print the html when print is here:",\_G.text) end end -- Access Google over SSL: network.request( "http://www.pnr.pt/agenda", "GET", networkListener ) print("\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_") print("it print NIL when the print is here:",\_G.text)

You are running into an issue where network.request() is an asynchronous call. That means that network.request() finishes immediately and your print statements execute while the request is still active. Fetching web content takes time and we don’t want to block your user interface while your app waits on the website to respond. A website could take 10s of seconds  to respond.

This is why you have a listener function. When that function is called, the network.request is complete and only there will event.response be set. So basically in your example _G.text is being set after you print it. Normally you would do the work inside the function or call some other function passing it the event.response value to work on.

Rob

You can use the network.request() API call to fetch the contents of a URL. For a web page, you’re going to get the whole page source including the <head> and full body.  You would need to parse through all of the HTML to get to the text that you want. Depending on the structure of the web page, this could be easy or it could be hard. See: https://docs.coronalabs.com/api/library/network/request.html

HTML is XML and you can find Lua based XML parsers (we have one in the Business App sample - https://github.com/coronalabs-samples/business-app-sample) that can make parsing the HTML page easier and you get a Lua table out of it.

Rob

Hello.

Another question, i used the sample code, works fins, printing the contents on console.

But i can´t put in a variable, so i can “decode” or “filter” it on other part of the program.

why?? The event.response is what?

if i put somethig like

text=event.response the variable text is null outside the function, even if i make it global…

Thanks in advance.

– The following sample code contacts Google’s encrypted search over SSL

– and prints the response (in this case, the HTML source of the home page)

– to the Corona terminal.

 

local function networkListener( event )

 

    if ( event.isError ) then

        print( "Network error: ", event.response )

    else

        print ( "RESPONSE: " … event.response )

    end

end

 

– Access Google over SSL:

network.request( “https://encrypted.google.com”, “GET”, networkListener )

Can you post your code using the <> code tag in the row with Bold and Italic (copy and paste your code from your text editor).  You talk above about variable named “text” but the code you posted doesn’t have that code.

Thanks

Rob

Hello again.

I´ve changed the code, so i put a corona doc example, my code is on down.

Also strange that it executes the line with the print “_____” and later the print included in the function…

Sorry, it could be a very simple problem…

local function networkListener( event ) if ( event.isError ) then print( "Network error: ", event.response ) else --print ( "RESPONSE: " .. event.response ) \_G.text=event.response print("it print the html when print is here:",\_G.text) end end -- Access Google over SSL: network.request( "http://www.pnr.pt/agenda", "GET", networkListener ) print("\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_") print("it print NIL when the print is here:",\_G.text)

You are running into an issue where network.request() is an asynchronous call. That means that network.request() finishes immediately and your print statements execute while the request is still active. Fetching web content takes time and we don’t want to block your user interface while your app waits on the website to respond. A website could take 10s of seconds  to respond.

This is why you have a listener function. When that function is called, the network.request is complete and only there will event.response be set. So basically in your example _G.text is being set after you print it. Normally you would do the work inside the function or call some other function passing it the event.response value to work on.

Rob