Web Development

Using Internet Explorer's HTMLParser

By Andrew Tucker, August 01, 1999

Microsoft's Internet Explorer 4.0 browser provides COM interfaces that let you easily load and parse HTML without actually having to display it. Andrew describes these interfaces and implements a C++ class that lets you take advantage of them.

Aug99: Using Internet Explorer's HTMLParser

Andrew works on development tools for Windows CE at BSQUARE Corp. He can be reached at [email protected].

Like it or not, HTML is becoming a part of every programmer's life. Beyond the obvious web page uses, HTML is also a good format for distributing platform-independent documentation, technical papers, and reference manuals. Partnered with a scripting language and applets or Component Object Model (COM) objects, HTML provides an easy way to create interactive examples and demos that anyone with a browser can utilize.

Because it is a text-layout language rather than a programming language, however, parsing HTML takes on a little different flavor. Commands in the form of tags and arguments are used to specify how something looks rather than how something is performed. Browsers and other tools just ignore unrecognized tags and arguments. To this extent, a browser usually parses HTML into an internal form that is then used for display.

Microsoft's Internet Explorer 4.0 (IE4) is a typical browser in this regard and provides COM interfaces that let you easily load and parse HTML without actually having to display it. In this article, I'll describe these interfaces and implement a C++ class, HTMLParser, which lets you take advantage of them. With a little help from the WinInet APIs, I'll then use HTMLParser to write a utility called CheckLinks that checks HTML pages for dead links. The complete source code and related files for CheckLinks is available electronically; see "Resource Center," page 5.

COM Interfaces

Although IE4 exposes approximately 189 different COM interfaces via MSHTML.DLL, I only need four to implement HTMLParser. For purposes here, the most important interface is IHTMLDocument2, which provides everything necessary to load and parse HTML. Once a document is loaded, an IHTMLElementCollection object is retrieved via interface functions. HTMLParser only uses get_links and get_images, but collections of other element types can be retrieved with alternative get_ methods. Each element in the document is represented by one of the 54 interfaces in Table 1, all of which are derived from IHTMLElement. HTMLParser will use IHTMLImgElement and IHTMLAnchorElement to retrieve the image and link URLs specified in the currently loaded document.

Listing One is the header file for HTMLParser. The #import statement causes the compiler to read the type library in MSHTML.DLL and create header files (MSHTML.TLH and MSHTML.TLI) that allow us to easily utilize the interfaces. Unfortunately, these header files do not contain some of the constants that we need to build HTMLParser, so you have to also include MSHTMDID.H. If you are using Visual C++ 6.0, this file is provided in the standard include directory; otherwise, you will have to install it from the Internet SDK or this article's source archive (available electronically). Another thing to note is that you must have IE4 installed for HTMLParser to work correctly. The MSHTML.DLL provided with previous versions of IE does not implement IHTMLDocument2 and will cause numerous compile time errors if used.

As you can see from Listing One, HTMLParser itself is actually a COM object, deriving from the stock interfaces IPropertyNotifySink, IOleClientSite, and IDispatch. Most of the functionality of these interfaces is unnecessary for our purposes, so the corresponding member functions just return an error code. The only COM interface methods that you need to implement in HTMLParser are the three standard IUnknown members (AddRef, Release, and QueryInterface), IPropertyNotifySink::OnChanged, and IDispatch::Invoke.

Listing Two implements HTMLParser. Like any COM object, it must implement the three standard IUnknown interfaces. The only interesting thing to note here is that the implementation of Release deletes itself when the reference count drops to zero. This implies that the object must have been allocated on the heap via new rather than declared as a local variable on the stack. Rather than stating this in the source file and dismissing it as caveat emptor, I made HTMLParser's constructors and destructor protected, and provided a public Create member function that returns a newly allocated object. This arrangement results in a compiler error if HTMLParser is declared as a local variable or via an explicit call to new and forces the user to always use the Create function.

IPropertyNotifySink::OnChanged is used to track the state of our IHTMLDocument2 object. HTMLParser's constructor sets up a connection point with IHTMLDocument2, requesting to be notified if any property values change. When OnChanged is called with a property ID value of DISPID_READYSTATE, it then retrieves the property's current value via a call to IHTMLDocument2::Invoke. If the value is READYSTATE_COMPLETE, then IE4 has finished parsing the document and you notify yourself of this via the PostThreadMessage API.

HTMLParser implements IDispatch::Invoke for one reason -- to avoid downloading or executing Java applets, scripts, and ActiveX controls. To accomplish this, you must utilize IOleControl::OnAmbientPropertyChange with the value DISPID_AMBIENT_DLCONTROL. Before you can do this, however, you must call IOleObject::SetClientSite so that IOleControl will work properly. This is the reason that you derive from IOleClientSite, even though you don't extend its functionality in any way. After the call to OnAmbientPropertyChange, IE4 will call your Invoke method to check the value of DISPID_AMBIENT_DLCONTROL. Invoke simply sets the correct bits to disable the behavior you don't want and returns NOERROR.

HTMLParser and HTML Files

Now that I've covered the HTMLParser's COM plumbing, I turn to the public methods used to actually load and get info from an HTML file. IsConnected is a simple function that can be used to determine if the constructor completed successfully, and will typically only fail if IE4 is not installed. LoadHTMLFile is the workhorse routine that specifies the document to parse. After making sure the object is valid and cleaning up old member variable values, it checks to see if the requested file exists. This is necessary to avoid IE4 displaying a "file not found" dialog box, which would create problems when HTMLParser is used in noninteractive batch sessions. You then use the object's IPersistFile interface to load the file and drop it into a message pump. You continue to dispatch messages until you receive the WM_USER_LOAD_COMPLETE notification from IPropertyNotifySink::OnChanged. This mechanism must be done via a message loop rather than using a WaitForSingleObject strategy on some kernel event. IE4 sends window messages in the process of parsing the file and will never complete unless they are properly dispatched. After you drop out of the message loop you fill in member variables with the link and image collections and return.

The remaining public member functions are simply wrappers around the IHTMLElementCollection objects created in LoadHTMLFile. GetLinkCount and GetImageCount return the number of items in the collection, while GetLinkURL and GetImageURL retrieve the data for an item. GetLinkURL and GetImageURL both call the internal workhorse function GetURLFromCollection. If the requested index is out of range, IHTMLElementCollection::raw_item returns an error and the function will return False. Otherwise, you retrieve the interface corresponding to the requested element type and retrieve the associated URL. If no URL exists, get_href returns a null BSTR and the function returns False.

HTMLParser In Action

CheckLinks.cpp (available electronically) utilizes HTMLParser to implement a dead-link checker. It takes a file specification on the command line and iterates through all the files, checking each link and reporting its status. The method used to actually check the status of the link depends on the protocol specified in the URL passed to WinInet::CheckLink. The full source to the WinInet class is available electronically. Since WinInet::CheckLink actually retrieves the file's status over the Internet, the connection speed is the bottleneck that determines how quickly it executes.

Conclusion

I've only scratched the surface of the possibilities for the HTML IE4 COM interfaces. CheckLinks could be improved to check internal anchor link consistency or verify that scripts and controls are available for download. HTMLParser can easily be extended to utilize additional IHTMLDocument2 functionality to accomplish whatever is necessary for your specific task.

DDJ

Listing One

/* Implement an HTML parser using IE4's IHTMLDocument2 interface. */

#ifndef __HTML_H__
#define __HTML_H__

#include <windows.h>
#include <string>

// if you are using VC6 or higher, get this from the stock include
// directory; otherwise get it from the Internet SDK

#if _MSC_VER >= 1200
#pragma warning(disable:4099)   // disable spurious namespace warnings
#include <mshtmdid.h>
#else
#include "./inetsdk/include/mshtmdid.h"
#endif

#import "mshtml.dll" named_guids no_namespace
using namespace std;

#define WM_USER_LOAD_COMPLETE   WM_USER+1
class HTMLParser: public IPropertyNotifySink, IOleClientSite, IDispatch
{
    public:
        static HTMLParser *Create();    // forces dynamic allocation
        STDMETHOD_(ULONG, Release)(); 
        BOOL LoadHTMLFile(LPCSTR pcszFile);
        long GetLinkCount();
        BOOL GetLinkURL(long lIndex, string &rstrURL);
        long GetImageCount();
        BOOL GetImageURL(long lIndex, string &rstrURL);
        BOOL IsConnected() const { return SUCCEEDED(m_hrConnected); }
    protected:
        // hidden constructors/destructor to force use of Create/Release
        HTMLParser(); 
        HTMLParser(const HTMLParser &); // eliminate compiler 
                                        // synthesized copy ctor
        virtual ~HTMLParser();
     // IUnknown methods
        STDMETHOD(QueryInterface)(REFIID riid, LPVOID* ppv);
        STDMETHOD_(ULONG, AddRef)();
    // IPropertyNotifySink methods
        STDMETHOD(OnChanged)(DISPID dispID);
        STDMETHOD(OnRequestEdit)(DISPID dispID) { return NOERROR; }
        // IOleClientSite methods
        STDMETHOD(SaveObject)(void) 
            { return E_NOTIMPL; }
        STDMETHOD(GetMoniker)(DWORD dwAssign,
                                   DWORD dwWhichMoniker, IMoniker** ppmk)
            { return E_NOTIMPL; }
        STDMETHOD(GetContainer)(IOleContainer** ppContainer)
            { return E_NOTIMPL; }
        STDMETHOD(ShowObject)(void)
            { return E_NOTIMPL; }
        STDMETHOD(OnShowWindow)(BOOL fShow)
            { return E_NOTIMPL; }
        STDMETHOD(RequestNewObjectLayout)(void)
            { return E_NOTIMPL; }
            // IDispatch method
        STDMETHOD(GetTypeInfoCount)(UINT* pctinfo)
            { return E_NOTIMPL; }
        STDMETHOD(GetTypeInfo)(UINT iTInfo, LCID lcid, ITypeInfo** ppTInfo)
            { return E_NOTIMPL; }
        STDMETHOD(GetIDsOfNames)(REFIID riid, LPOLESTR* rgszNames,
                                UINT cNames, LCID lcid, DISPID* rgDispId)
            { return E_NOTIMPL; }
        STDMETHOD(Invoke)(DISPID dispIdMember, REFIID riid, LCID lcid,
            WORD wFlags, DISPPARAMS __RPC_FAR *pDispParams,
            VARIANT __RPC_FAR *pVarResult, EXCEPINFO __RPC_FAR *pExcepInfo,
            UINT __RPC_FAR *puArgErr);
        // helper functions
        BOOL GetURLFromCollection(IHTMLElementCollection *pCollection, 
                                  REFIID rIID, long lIndex, string &rstrURL);
        // member variables
        DWORD   m_dwRef;
        HRESULT  m_hrConnected;
        DWORD    m_dwCookie;
        IHTMLDocument2* m_pMSHTML;
        LPCONNECTIONPOINT m_pCP;
        IHTMLElementCollection *m_pAnchorLinks;
        IHTMLElementCollection *m_pImageLinks;
};
#endif

Back to Article

Listing Two

/* Implement an HTML parser using IE4's IHTMLDocument2 interface. */
#include <windows.h>
#include <comdef.h>
#include <io.h>
#include "html.h"
#include <iostream>
using namespace std;
/* static function used to force dynamic allocation */
HTMLParser *HTMLParser::Create()
{
    return new HTMLParser;
}
// constructor/destructor
HTMLParser::HTMLParser()
{
    HRESULT hr;
    LPCONNECTIONPOINTCONTAINER pCPC = NULL;
    LPOLEOBJECT pOleObject = NULL;
    LPOLECONTROL pOleControl = NULL;
    // initialize all the class member variables
    m_dwRef = 1;    // must start at 1 for the current instance
    m_hrConnected = S_FALSE;
    m_dwCookie = 0;
    m_pMSHTML = NULL;
    m_pCP = NULL;
    m_pAnchorLinks = NULL;
    m_pImageLinks = NULL;
    // Create an instance of an dynamic HTML document
    if (FAILED(hr = CoCreateInstance( CLSID_HTMLDocument, NULL, 
           CLSCTX_INPROC_SERVER, IID_IHTMLDocument2, (LPVOID*)&m_pMSHTML )))
    {
        goto Error;
    }
    if (FAILED(hr = m_pMSHTML->QueryInterface(IID_IOleObject, 
                                                   (LPVOID*)&pOleObject)))
    {
        goto Error;
    }
   hr = pOleObject->SetClientSite((IOleClientSite*)this);
    pOleObject->Release();
    if (FAILED(hr = m_pMSHTML->QueryInterface(IID_IOleControl, 
                                                   (LPVOID*)&pOleControl)))
    {
        goto Error;
    }
    hr = pOleControl->OnAmbientPropertyChange(DISPID_AMBIENT_DLCONTROL);
    pOleControl->Release();
    // Hook up sink to catch ready state property change
    if (FAILED(hr = m_pMSHTML->QueryInterface(IID_IConnectionPointContainer, 
                                                            (LPVOID*)&pCPC)))
    {
        goto Error;
    }
    if (FAILED(hr = pCPC->FindConnectionPoint(IID_IPropertyNotifySink, 
                                                                  &m_pCP)))
    {
        goto Error;
    }
    m_hrConnected = m_pCP->Advise((LPUNKNOWN)(IPropertyNotifySink*)this, 
                                                                &m_dwCookie);
Error:
    if (pCPC) pCPC->Release();
}
HTMLParser::~HTMLParser()
{
    if ( m_pAnchorLinks )
        m_pAnchorLinks->Release();
    if ( m_pImageLinks )
        m_pImageLinks->Release();
    if (SUCCEEDED(m_hrConnected))
        m_pCP->Unadvise(m_dwCookie);
    if (m_pCP) 
        m_pCP->Release();
    if ( m_pMSHTML )
        m_pMSHTML->Release();
}
STDMETHODIMP HTMLParser::QueryInterface(REFIID riid, LPVOID* ppv)
{
    *ppv = NULL;
    if (IID_IUnknown == riid || IID_IPropertyNotifySink == riid)
    {
        *ppv = (LPUNKNOWN)(IPropertyNotifySink*)this;
        AddRef();
        return NOERROR;
    }
    else if (IID_IOleClientSite == riid)
    {
        *ppv = (IOleClientSite*)this;
        AddRef();
        return NOERROR;
    }
    else if (IID_IDispatch == riid)
    {
        *ppv = (IDispatch*)this;
        AddRef();
        return NOERROR;
    }
   else
        return E_NOTIMPL;
}
STDMETHODIMP_(ULONG) HTMLParser::AddRef()
{
    return ++m_dwRef;
}
STDMETHODIMP_(ULONG) HTMLParser::Release()
{
    if (--m_dwRef == 0) 
    { 
        delete this; 
        return 0; 
    }
    return m_dwRef;
}
STDMETHODIMP HTMLParser::OnChanged(DISPID dispID)
{
    HRESULT hr;
    if (DISPID_READYSTATE == dispID)
    {
        VARIANT vResult = {0};
        EXCEPINFO excepInfo;
        UINT uArgErr;
        long lReadyState;
        DISPPARAMS dp = {NULL, NULL, 0, 0};
        if (SUCCEEDED(hr = m_pMSHTML->Invoke(DISPID_READYSTATE, IID_NULL, 
                          LOCALE_SYSTEM_DEFAULT, DISPATCH_PROPERTYGET, 
                          &dp, &vResult, &excepInfo, &uArgErr)))
        {
            lReadyState = (READYSTATE)V_I4(&vResult);
            switch (lReadyState)
            {   
            case READYSTATE_UNINITIALIZED:
            case READYSTATE_LOADING: 
            case READYSTATE_LOADED: 
            case READYSTATE_INTERACTIVE:
                break;
            case READYSTATE_COMPLETE: 
                // IE4 is finished parsing the file
                BOOL fRet = PostThreadMessage(GetCurrentThreadId(),
                                WM_USER_LOAD_COMPLETE, (WPARAM)0, (LPARAM)0);
                break;
            }
            VariantClear(&vResult);
        }
    }
    return NOERROR;
}
STDMETHODIMP HTMLParser::Invoke(DISPID dispIdMember, REFIID riid, LCID lcid,
            WORD wFlags, DISPPARAMS __RPC_FAR *pDispParams,
            VARIANT __RPC_FAR *pVarResult, EXCEPINFO __RPC_FAR *pExcepInfo,
            UINT __RPC_FAR *puArgErr)
{
   if (!pVarResult)
    {
        return E_POINTER;
    }
    switch(dispIdMember)
    {
    case DISPID_AMBIENT_DLCONTROL:
        // This tells IE4 that you want to download the page, but you don't 
        // want to run scripts, Java applets, or ActiveX controls
        V_VT(pVarResult) = VT_I4;
        V_I4(pVarResult) =  DLCTL_DOWNLOADONLY | 
                            DLCTL_NO_SCRIPTS | 
                            DLCTL_NO_JAVA |
                            DLCTL_NO_DLACTIVEXCTLS |
                            DLCTL_NO_RUNACTIVEXCTLS;
        break;
    default:
        return DISP_E_MEMBERNOTFOUND;
    }
    return NOERROR;
}
BOOL HTMLParser::LoadHTMLFile(LPCSTR pcszFile)
{
    HRESULT        hr;
    LPPERSISTFILE  pPF;
    IHTMLElementCollection* pColl = NULL;
    MSG msg;
    if ( !IsConnected() )
        return FALSE;
    // kill any previous links
    if ( m_pAnchorLinks )
    {
        m_pAnchorLinks->Release();
        m_pAnchorLinks = NULL;
    }
    if ( m_pImageLinks )
    {
        m_pImageLinks->Release();
        m_pImageLinks = NULL;
    }
    // avoid IE error msg box if the file does not exist
    if ( access(pcszFile, 0x00) != 0x00 )
    {
        return FALSE;
    }
    _bstr_t bstrFile(pcszFile);
    // use IPersistFile to load the HTML
    if ( SUCCEEDED(hr = m_pMSHTML->QueryInterface(IID_IPersistFile, 
                                                         (LPVOID*) &pPF)))
    {
        hr = pPF->Load((LPCWSTR)bstrFile, 0);
        pPF->Release();
    }
    BOOL bOK = FALSE;
    if (SUCCEEDED(hr))
    {
        while (GetMessage(&msg, NULL, 0, 0))
        {
            // notification from OnChanged
            if (WM_USER_LOAD_COMPLETE == msg.message && NULL == msg.hwnd)
            {
                bOK = TRUE;
                break;
            }
            else
            {
                DispatchMessage(&msg);
            }
        }
    }
    if ( bOK )
    {
        try
        {
            if ( FAILED(m_pMSHTML->get_links(&m_pAnchorLinks)) ||
                 FAILED(m_pMSHTML->get_images(&m_pImageLinks)) ) 
            {
                throw exception();
            }
        } 
        catch ( exception e )
        {
            if ( m_pAnchorLinks )
            {
                m_pAnchorLinks->Release();
                m_pAnchorLinks = NULL;
            }
            if ( m_pImageLinks )
            {
                m_pImageLinks->Release();
                m_pImageLinks = NULL;
            }
            bOK = FALSE;
        }
    }
    return bOK;
}
/* Get the number of links present in the current HTML file */
long HTMLParser::GetLinkCount()
{
    long lCount = 0;
    if ( m_pAnchorLinks )
        m_pAnchorLinks->get_length(&lCount);
    return lCount;
}
/* Get the number of images present in the current HTML file */
long HTMLParser::GetImageCount()
{
    long lCount = 0;
    if ( m_pImageLinks )
        m_pImageLinks->get_length(&lCount);
    return lCount;
}
/* Get the URL associated with a given link */
BOOL HTMLParser::GetLinkURL(long lIndex, string &rstrURL)
{
    if ( IsConnected() && m_pAnchorLinks )
        return GetURLFromCollection(m_pAnchorLinks, 
                                IID_IHTMLAnchorElement, lIndex, rstrURL);
    else
        return FALSE;
}
/* Get the URL associated with a given image */
BOOL HTMLParser::GetImageURL(long lIndex, string &rstrURL)
{
    if ( IsConnected() && m_pImageLinks )
        return GetURLFromCollection(m_pImageLinks, IID_IHTMLImgElement, 
                                                         lIndex, rstrURL);
    else
        return FALSE;
}
/* Get URL associated with an element in a collection. Element must be an 
image or an anchor. */
BOOL HTMLParser::GetURLFromCollection(IHTMLElementCollection *pCollection, 
                                  REFIID rIID, long lIndex, string &rstrURL)
{
    VARIANT     varIndex;
    VARIANT     var2;
    HRESULT     hr;
    IDispatch*  pDisp = NULL; 
    BOOL        bFound = FALSE;

    varIndex.vt = VT_UINT;
    varIndex.lVal = lIndex;
    VariantInit( &var2 );
    hr = pCollection->raw_item( varIndex, var2, &pDisp );

    if ( SUCCEEDED(hr) && pDisp)
    {
        IHTMLImgElement* pImgElem = NULL;
        IHTMLAnchorElement* pAnchorElem = NULL;
        BSTR bstr = NULL;
        if ( rIID == IID_IHTMLImgElement &&             
             SUCCEEDED(pDisp->QueryInterface(rIID, (void **)&pImgElem)) )
        {
            pImgElem->get_href(&bstr);
            pImgElem->Release();
            bFound = (bstr != NULL);
        }
        else if ( rIID == IID_IHTMLAnchorElement &&             
                  SUCCEEDED(pDisp->QueryInterface(rIID, 
                                              (void **)&pAnchorElem)) )
        {
            pAnchorElem->get_href(&bstr);
            pAnchorElem->Release();
            bFound = (bstr != NULL);
        }

       pDisp->Release();
        if ( bFound && bstr )
        {
            // _bstr_t wrapper will delete since fCopy is FALSE
            _bstr_t bstrHREF(bstr, FALSE);
            rstrURL = (LPCSTR)bstrHREF; 
        }
    }
    return bFound;
}

Back to Article

1 2 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Web Development