[IE] Loading HTML content from any source

Programming applications for MS Windows

[IE] Loading HTML content from any source

Postby Administrator » 17-Apr-2014, 22:10

Introduction
Some time ago I faced with the task to dynamically load HTML-page and parse its content. I thought this "trivial problem" will take couple of hours. In fact it took a week to solve. Some solutions I have found are ugly or buggy, wrong or incomplete, some of which are from Microsoft technical articles and blogs.

I found the IHTMLDocument2 COM-interface for the MSHTML. This is the kernel of the IE browser built-in every Windows system. MSDN-library gives brief description of its methods and properties. This COM-interface can be used to parse HTML pages and find tags (<img>, <href>, <h1>, etc) to extract images, links and headers.

The DHTML Object Model is used to access and manipulate the contents of an HTML page and is not available until the page is loaded.

The IPersistStreamInit interface, and its associated methods, can be used to load HTML content from a stream using the WebBrowser control and Microsoft Visual C++. The IWebBrowser2::Navigate2 method of the IWebBrowser2 interface enables you to navigate the browser to a URL. Navigating to this empty page ensures that MSHTML is loaded and that the HTML elements are available through the Dynamic HTML (DHTML) Object Model.

Code: Select all
m_pBrowser->Navigate2( _T("about:blank"), NULL, NULL, NULL, NULL ); 


I saw that the IHTMLDocument2 interface has write() method and includes the put_innerHtml() function, so I figured that this would be a trivial problem to solve. I wished to create HTML-page from memory buffer and display it to the user. I decided to use Unicode character set, because I always try to support national languages in my applications. Besides some solutions does not work properly with non-Unicode pages.

Loading HTML content from a Stream
The first article I found in MSDN was the “Loading HTML content from a Stream”. This is the solution recommended by Microsoft. Unfortunately, it took lots of time to make this sample work in my project. No matter what I tried, I couldn't get it to work. My call to Load() would return success, but my data wouldn't appear in the document. This solution wouldn't have worked anyway because it calls IPersistStreamInit::Load() and initiates document reloading and MIME-bug in IE6 kills this approach.

I found other people with the same problem. It turns out that a few important details were omitted in the MSDN article. In some blogs people described how to make IPersistStreamInit::Load() work properly. Recent versions of MSHTML require a message loop. Apparently, older versions of MSHTML did not. Without a message loop, Load() returns success but the actual work of loading the HTML is performed asynchronously.

Code: Select all
PTCHAR pMyHtml = _T("<html><body>New HTML form memory!</body></html>");
pStream->Write( pMyHtml, lstrlen(pMyHtml)*sizeof(TCHAR), NULL );
pPersistStreamInit->Load( pStream ); // HANGS!!!
CComBSTR bsStatus;
while(
 SUCCEEDED(pHtmlDoc->get_readyState(&bsStatus)) && (bsStatus != L"complete") )
{
    MSG msg;
    if( ::PeekMessage( &msg, NULL, 0, 0, PM_NOREMOVE) )
    {
        AfxGetApp()->PumpMessage();
    }
}
 

The application determines that a page is loaded by handling the DWebBrowserEvents2::DocumentComplete() event of the WebBrowser control. The IPersistStreamInit interface has InitNew() and Load() methods that are used to initialize and load an HTML document from a stream. The InitNew() method initializes the stream to a known state and the Load() method loads the HTML content from the stream.
Note In Microsoft Internet Explorer 5, more than one call to the Load method of the IPersist interfaces is supported. In earlier versions, only one call to Load per instance of MSHTML is supported.

Unfortunately, after going through all this time and effort to make Load() function, I discovered a small fact that made all of this effort useless: there's a bug in the MIME sniffing code used by IPersistStream::Load(). The bug is documented in comments followed this MSDN article.
IPersistStreamInit::Load() has a known bug described in BUG: PersistStreamInit::Load() Displays HTML Files as Text at http://support.microsoft.com/?id=323569. This bug makes this article's solution unsuitable for a production environment. Instead, use IPersistMoniker to reliably load a stream.
MandatoryDefault
10/18/2008

APPLIES TO
Microsoft Internet Explorer (Programming) 6 (SP1)
Microsoft Internet Explorer (Programming) 5.5 SP2


I found Jim Beveridge blog where he analyzed common solutions. There is a link to Philip Patrick’s article on CodeProject titled, Loading and parsing HTML using MSHTML. 3rd way. This was easy solution to use because it worked the first time and required no message loop. Jim says, the write() method requires the HTML to be passed as Unicode. The only challenge to using IHtmlDocument2::write() is properly setting up the SAFEARRAY. The sample code in MSDN shows how to do this. Although the article did solve my exact problem, it did lead me to solution. So I decided to use the IHtmlDocument2::write() method.
Code: Select all
VARIANT *param;
SAFEARRAY* psa = SafeArrayCreateVector(VT_VARIANT, 0, 1);
CComBSTR bsData = L"<html><body>Dynamic Web-page.</body></html>";
if(
 psa )
{
   
SafeArrayAccessData(psa, (LPVOID*)&param);
   param->vt = VT_BSTR;
   param->bstrVal = bsData;
   SafeArrayUnaccessData(psa);
   pHtmlDoc->write( psa );
   pHtmlDoc->close();
   SafeArrayDestroy(psa);


Spoiler:
My last try was to use the IHTMLDocument::get_body() method.
Code: Select all
IHTMLElement *pBodyElement;
hr = pHtmlDoc->get_body( &pBodyElement );
if(
 SUCCEEDED( hr ) )
{
   USES_CONVERSION;
   BSTR bsMainText;
   pBodyElement->get_innerHTML( &bsMainText );
   // … CHANGE HTML
   pBodyElement->put_innerHTML( bsMainText );
Administrator
Site Admin
 
Posts: 43
Joined: 26-Feb-2014, 17:54

Return to Windows, .Net, OpenGL



cron