Net Gain
Al Williams
Al is a consultant and author based in the Houston area. Look for his latest book about ActiveX control (Coriolis Group Books). You can find Al on the Web at http://ourworld.compuserve.com/homepages/Al_Williams.
Do you remember the big CB radio craze in the mid '70s? For no apparent reason (other than, perhaps, Radio Shack's massive ad campaigns), everyone seemed to discover CB radio at the same time. Before the craze hit, you could listen to tree-cutting companies, construction crews, and (of course) truckers on CB's 23 channels. Suddenly, neighborhoods and cars sprouted CB antennas at a phenomenal rate. So many people bought CB radios, in fact, that the airwaves were swamped.
The FCC reacted by expanding the band to 40 channels and letting people create their own call signs. CB paraphernalia--magazines, caps, and the like--started showing up everywhere. CB clubs were formed, and there were even country and western CB songs.
But after a few years (again for no apparent reason), the craze died down. The people who originally used it either stayed or moved to UHF CB bands. Interestingly, most of the people who bought CBs didn't really know if they needed one. They only knew that if everybody else had them, then they should too.
Does this pattern sound familiar? What else has been around for years, was suddenly discovered by marketeers and the public, and then grew explosively? The Internet, of course. Only time will tell if this is a fad that will die out or a major change in our culture. While I'm guessing some of the hype will die down, there's no question that the Internet is more useful than CB and more people will remain when other new fads crop up. (By the way, I've yet to hear a country and western song about losing a girlfriend on the Internet.)
For the time being, the Internet is hot. If you want to be the life of a party, mention the Net. All sorts of people now have web pages (as if you cared what kind of pizza they like), companies have to have a Web presence (although most don't know why), and you register or update software via the Web.
In times past, adding Net-related functionality to your software required you to know about exotic TCP/IP stacks, domain name services, and bizarre protocols. With a new set of tools from Microsoft, you can sidestep all this complexity and just drop Internet functions into your Visual Basic, Delphi, or C++ code.
The key to this is Microsoft's ActiveX Internet Control Pack (ICP), which provides ActiveX controls (formerly known as OLE controls) for many common Internet tasks (see Table 1). In this column, I'll show you a Visual Basic program that will
- Scan a list of URLs to detect pages that have changed since the last time the program ran.
- Allow you to view the pages with a built-in web browser.
- Let you maintain the list of URLs that it checks.
Without the ICP, all of this would be a daunting task. With the ICP, it boils down to some simple Visual Basic programming. At this writing, the ICP is a beta product, but you can download it for free from the Microsoft web site (http://www.microsoft.com).
An ActiveX Primer
ActiveX is essentially OCX with a new name. Okay, it's a bit more than that, but for most VB programs that definition is good enough. An ActiveX control is a drop-in module that you can add to Visual Basic's component palette. You use the Tool menu's Custom Control selection to add the control. This makes the component appear on the palette so that you can drop it on any form.
Once you have a component on a form, you manipulate it like any other VB object. You'll find properties and events that are similar to any other VB component. You can also use the Object Browser (available on the View menu) to scan the component's elements (see Figure 1).
In addition to the custom controls, you need to add a reference to the support objects that the controls use. You can do this by selecting the References item from the Tools menu. Make sure the "Microsoft Internet Support Objects" selection has a check mark next to it. If you fail to do this, the components will not be able to find other objects that they use.
The Design
To demonstrate the ICP, I wrote a program that scans a list of URLs and tells me when they have changed. Since the ICP has a built-in web browser control, I decided I'd also allow the program to display any of the URLs.
The program has four forms and three major sections. The most important section scans the list of URLs, retrieves them from the Internet, and computes a checksum. If the checksum is different from the last computed checksum, the form places the URL in a listbox. The file that holds the URLs also holds their last computed checksums. This is a perfect place to use the HTTP control.
If you place a filename on the command line, the scanner will use that file; otherwise, the program prompts for a file. If you specify a file that doesn't exist, you will create an empty list. You can then add entries to the file via the user interface.
The second part of the program is a simple web browser that shows the page and nothing more. You can click on a link to open a linked page, but you can't move forward and backward as in a conventional browser.
The third part is a simple form to manage the list of URLs. From this form, you can select a URL and access the simple browser. This allows you to look at any page the scanner knows about, without launching a full-featured browser. Other than this link to the browser, this code doesn't use ICP at all. When you make changes to the list, the form prompts to see if it should save the changes.
I included a fourth form to provide a tip form that offers some simple usage help. Of course, a help file would be better, but for a project this small, it is often easier to just write a small form with some text. While this isn't as nice as a full help file, it is better than nothing (which is the more likely alternative).
The file format is simple. The first data item is a count of URLs in the list. This appears as ASCII text on a line by itself. The next line contains a URL. Finally, there is a line with the last known checksum (or -1 if the URL has no previous checksum). The file continues with URL/checksum pairs. This ASCII format isn't very efficient, but it is easy for debugging and, in practice, the URL lists aren't very large anyway.
You might question the relative efficiency of computing a checksum on the entire page to decide if it changed or not. While this isn't ideal, there doesn't seem to be a consistently better way to determine when a page changes. The mime header (the same header you see on many mail
messages) has provisions for a date-modified field. Ideally, you'd just check this. However, not all servers set this field. It wouldn't be a bad idea to check for the date first and then compute the checksum only if the date is not available. Of course, any page that regularly changes to display advertisements, counters, or other material on each hit will always appear to change regardless of the method used.
Using the HTTP Control
You can use the HTTP control to send or receive documents using the HTTP protocol. This means you can read ordinary web pages (or send them, if you prefer). This control doesn't display the pages, nor does it open arbitrary URLs. You can't use an HTTP control to work with ftp://ftp.mv.com, for example. You can use it to read the http://www.ddj.com page, however.
Table 2 is a summary of the HTTP control's properties, methods, and events. Don't worry. You won't need to use most of these in most of your programs. To read an HTTP document, you'll need to deal with the control's DocOutput property (see Table 3). This is a special object that represents the incoming data. You can examine the header data or the actual document data. To initiate a transfer, call the GetDoc method.
Data arrives at the control in chunks. Each time the control has more data, it generates a DocOutput event. This is your cue to read the data and put it away. The DocOutput.State property tells you why the control generated the event (see Table 4). You can access header data using the DocOutput's Headers property. You can use this collection to examine all the headers (using the Count and Text properties). You can also index into the collection and retrieve a particular value by its name; for example, ctype=DocOutput.DocHeaders.Item("content-type").Value will assign the value of the content-type field to the ctype string variable. For example, the ctype variable might take the value "text/plain" for a typical web page.
If you prefer, you can specify a file name as an argument to GetDoc. Then the control will just store the data in the file, and you can then ignore the DocOutput event.
Another property you may use is the Errors property, which provides an indication of any error that occurs. The Errors property is of type icError. You can access the Code, Type, or Description property to get more details about a particular error. The HTTP control will generate an Error event on some errors. Other errors show up as icDocError codes in the DocOutput.State property during the DocOutput event.
To summarize, these are the basic steps for using the HTTP control:
1. Call the object's GetDoc method.
2. Handle the DocObject event.
3. In the event handler, check the DocOutput.State and read data as required using DocOutput.Headers and DocOutput.GetData.
4. Handle errors when DocOutput.State is equal to icDocError or when the control raises an Error event.
Using the HTML Control
If you want to display an HTML page, there are two ways you can use the HTML control. The first way is very simple; just pass the URL you want to show as an argument to the RequestDoc method. The second way, which gives you more control, is to handle DocOutput events just like the HTTP control. You might do this if, for example, you want to display an animation that runs while the page loads.
There are many other things you can do with the HTML control (see the ICP help). However, for most simple applications, you don't need to do anything other than use RequestDoc. It is probably a good idea to handle the Error event, too.
Implementation
My first attempt to write this program was very simple. During the FormLoad event, the code read the URL list from a file. Then it called the ScanURL subroutine, which initiated a transfer (using GetDoc). The DocOutput event did all the work. Then, when the transfer completed, the DocOutput event handler initiated a new GetDoc request.
This sounds good, but it didn't work well at all. The problem is that while the handler is running (even if the state is icDocEnd) the HTTP control can't accept another GetDoc command. A similar problem occurs when you try to use the Busy event to detect changes in the busy state. The control doesn't change the busy flag when it encounters an error while connecting. However, you are not allowed to start a new connection from within the Error event handler.
I tried a variety of ways to fix this problem. I finally decided to change the ScanURL routine to start a timer. When the timer event fires, the code checks to see if the HTTP control is busy. If it isn't, the timer code starts a new transfer (until there are no more URLs to check). All of the work still occurs in the DocOutput event handler. Each chunk of data that arrives adds to a checksum. The checksum is a LongInt, and before it overflows (which would require more than 16 MB of data), the code wraps the overflow around.
You can find the main scan code in Listing One. Figure 2 is the resulting program. Notice that most of the code reads the URL list. The code that reads data from the network only appears in the timer handler and in the DocOutput handler.
If you elect to view a URL, the main form places the URL you select into the global URL variable, and then starts the HTML form (see Listing Two). This form is unbelievably simple since the HTML control does all the real work. The only code in the form sets the title, communicates the URL to the control, and
resizes the control when the form's size changes.
The other modules in the program manage the URL list and show some tips. You can find the files for these forms online. When you are maintaining the list, you can view any URL by clicking a button. The code for this button uses the HTML form again to show the specified page.
Before you can use the HTTP and HTML controls in a program, you have to install them on Visual Basic's component palette using the Custom Control comment on the Tools menu.
Problems with the ICP
Although there were some synchronization issues when using the HTTP control, it still seems much simpler than trying to write the code without a control. In particular, you don't need to know anything about the protocol or how to start the network. The HTML control certainly is simple to use.
I found several rough spots in these controls. To be fair, the controls I used are from the second beta test. These problems may or may not occur in the final release:
- If you are using dial-up networking (and it is properly configured), the program will start your default connection the first time you run it. If you cancel the connection, you get connection errorsthat's not unexpected. However, each time you restart the program it will immediately give you the same connection errors until you reboot the system!
- There doesn't appear to be any simple way to make the HTML control scroll with the page-up and page-down keys.
- The HTTP control won't accept URLs like file://.... It only takes HTTP URLs.
- The HTML control doesn't properly read some embedded graphics, nor will it play audio files, and so on.
I discovered another issue, though this one isn't a problem with the ICP. I set the common dialog to use the URLS extension. This is legal under Windows NT (with NTFS) and Windows 95. However, the common dialog control doesn't always understand that the default extension might be more than three characters! The dialog functions properly unless you specify a file name that doesn't exist and doesn't have an extension. In such cases, the dialog just uses the first three characters of the extension. I didn't attempt to correct this behavior.
Don't forget, if a page changes to display a counter or an advertisement each time you access it, the scanner will always show a change. Also, remember that a large page may take a long time to load.
The End of the Internet?
With the ICP, you don't have much excuse not to get your programs wired to the Net. Although this example uses Visual Basic, there is no reason you can't use the ICP with Delphi, C++, or any other language that supports OCX controls.
Only time will tell if the Internet will survive its own success. And who knows? You might just write that Internet country and western song.
Table 1: ICP components.
Component Purpose FTP Transfer files using FTP. HTML Display HTML documents (including Web pages). HTTP Retrieve documents from the network using the HTTP protocol. NNTP Retrieve and post news articles. POP3 Receive mail using a POP server. SMTP Send mail using SMTP. TCP Creates or connects to a reliable socket to exchange data. UDP Creates or connects to an unreliable (datagram) socket to exchange data.
Table 2: The HTTP control's members: (a) properties; (b) methods; (c) events.
Method Description (a) Busy True if control is busy. DocOutput DocOutput object that represents the incoming document (see Table 3 for more about this object). DocInput DocInput object that represents an outgoing document. Document In conjunction with the RemoteHost property, this property sets the URL; usually, you'll use the URL property instead. EnableTimer Lets you enable or disable specific timeouts. Errors Provides specific error information; the Description field provides a text description of the error Method Sets the action the control will take; you can request only the headers, for example. NotificationMode Lets you request continuous notification, or defer notifications until a transaction completes. ProtocolState Shows the state (for example, connected) for the control. RemoteHost Use this property in combination with the Document property to specify a document (also see URL). RemotePort Remote port to connect to. ReplyCode Returns the response code for the last operation. ReplyString Returns the text associated with the response code for the last operation. State The current state of the control (see Table 4). TimeOut Sets the actual timeout value. URL URL for the current document. (b) Cancel Cancels current operation. GetDoc Request a document; you may specify the URL and process data in the DocOutput event or ask the control to store the data in a file. PerformRequest Perform operation specified by Method property (usually you use GetDoc or SendDoc instead). SendDoc Sends a document; you may specify the URL. (c) Busy Fires when the control's Busy property changes. Cancel Fires when the current operation cancels. DocInput Fires when you must supply data for an outbound document. DocOutput Fires when you may retrieve data for an inbound document. Error Fires when an error occurs. ProtocolStateChanged Fires when the protocol state changes. StateChanged Fires when state changes. TimeOut Fires when a timeout occurs
Table 3: The DocOutput object: (a) properties; (b) methods.
Property/Method Description (a) BytesTotal Number of bytes to transfer or zero if unknown (usually the case). BytesTransferred Number of bytes transferred. DocLink Used to specify that this output should link to a DocInput's input data. Filename Filename for document. Headers Provides access to the HTTP headers. PushStreamMode Not used for DocOutput objects. State Reflects state of transfer (see Table 5). Suspended True when the transfer is suspended. (b) GetData Get input data. SetData Set output data. Suspend Suspend transfer.
Table 4: Possible values for the HTTP state property.
Constant Value Description prcConnecting 1 Connecting, waiting for connect acknowledge. prcResolvingHost 2 Resolving remote host from symbolic name to IP address. prcHostResolved 3 Host name resolution complete. prcConnected 4 Connection established. prcDisconnecting 5 Initiated connection close or disconnect. prcDisconnected 6 Initial state (also state after unsuccessful connects or a disconnect).
Table 5: Possible values for the DocOutput state property.
Name Value Description icDocNone 0 No transfer is in progress. icDocBegin 1 Transfer is being initiated. icDocHeaders 2 Document headers are transferred. icDocData 3 One block of data is transferred. icDocError 4 An error has occurred during transfer. icDocEnd 5 Transfer is complete (either successfully or with an error).
Figure 1: Visual Basic Object Browser.
Figure 2: URLScan in action.
Listing One
VERSION 4.00
Begin VB.Form Scan
BorderStyle = 1 'Fixed Single
Caption = "URL Scan"
ClientHeight = 4470
ClientLeft = 1185
ClientTop = 1650
ClientWidth = 8385
Height = 4905
Left = 1125
LinkTopic = "Form1"
MaxButton = 0 'False
MinButton = 0 'False
ScaleHeight = 4470
ScaleWidth = 8385
Top = 1275
Width = 8505
Begin VB.CommandButton Tips
Caption = "&Tips"
Height = 375
Left = 4800
TabIndex = 7
Top = 3600
Width = 975
End
Begin VB.Timer Timer1
Enabled = 0 'False
Interval = 1000
Left = 7080
Top = 3840
End
Begin VB.CommandButton Config
Caption = "&Configure"
Height = 375
Left = 3480
TabIndex = 4
Top = 3600
Width = 975
End
Begin VB.CommandButton View Caption = "&View"
Enabled = 0 'False
Height = 375
Left = 2160
TabIndex = 3
Top = 3600
Width = 975
End
Begin VB.CommandButton ReScan
Caption = "&Rescan"
Height = 375
Left = 840
TabIndex = 2
Top = 3600
Width = 975
End
Begin VB.ListBox List1
Height = 2595
Left = 720
TabIndex = 0
Top = 480
Width = 7335
End
Begin MSComDlg.CommonDialog OpenDialog
Left = 7680
Top = 3720
_Version = 65536
_ExtentX = 847
_ExtentY = 847
_StockProps = 0
CancelError = -1 'True
DefaultExt = ".urls"
DialogTitle = "Select a list of URLs"
Filter = "URL files (*.urls)|*.urls"
Flags = 8196
End
Begin ComctlLib.ProgressBar Progress
Height = 255
Left = 6360
TabIndex = 6
Top = 3240
Visible = 0 'False
Width = 1695
_Version = 65536
_ExtentX = 2990
_ExtentY = 450
_StockProps = 192
Appearance = 1
End
Begin VB.Label Status
Height = 255
Left = 840
TabIndex = 5
Top = 3240
Width = 5415 End
Begin HTTPCTLib.HTTP HTTP1
Left = 6480
Top = 3840
_ExtentX = 635
_ExtentY = 635
RemoteHost = "127.0.0.1"
RemotePort = 80
ConnectTimeout = 0
RecvTimeout = 0
NotificationMode= 1
Document = ""
Method = 1
End
Begin VB.Label Label1
Caption = "Changed URLs:"
Height = 255
Left = 720
TabIndex = 1
Top = 120
Width = 1215
End
End
Attribute VB_Name = "Scan"
Attribute VB_Creatable = False
Attribute VB_Exposed = False
Sub ScanURLs()
' Start scan process via timer
List1.Clear
View.Enabled = False
nextct = 0
If ct <> 0 Then
ReScan.Enabled = False
Timer1.Enabled = True
Progress.Visible = True
End If
End Sub
Private Sub Config_Click()
ConfigForm.Show vbModal
End Sub
Private Sub Form_Load()
ct = 0
If Command <> "" Then ' read command line
filename = Command
Else
' prompt for file
On Error Resume Next
OpenDialog.ShowOpen
If Err.Number <> 0 Then End
On Error GoTo 0
filename = OpenDialog.filename
End If Scan.Caption = Scan.Caption + " (" + filename + ")"
On Error Resume Next
Open filename For Input As #1
If Err.Number <> 0 Then Exit Sub ' new file!
' Read count
Line Input #1, T$
ct = Val(T$)
' Resize everything
ReDim URLs(ct)
ReDim CkSum(ct)
' Read file
For i = 0 To ct - 1
Line Input #1, URLs(i)
Line Input #1, T$
CkSum(i) = Val(T$)
Next i
Close 1
ScanURLs
End Sub
Private Sub HTTP1_DocOutput(ByVal DocOutput As DocOutput)
Static errflag As Boolean
Static csum As Long
Static progNoGood As Boolean
Dim v As Long
' Set up on icDocBegin
If DocOutput.State = icDocBegin Then
errflag = False
csum = 0
tmp = DocOutput.BytesTotal
If tmp <> 0 Then
Progress.Max = tmp ' we know how big it is!
progNoGood = False
Else
' usually the control has no idea how big the file is
' so fake it
Progress.Max = 4096
progNoGood = True
End If
Progress.Value = 0
Exit Sub
End If
If DocOutput.State = icDocError Then errflag = True
' Calculate checksum
If Not errflag And DocOutput.State = icDocData Then
DocOutput.GetData Data, vbArray + vbByte
If progNoGood Then
If Progress.Value + UBound(Data) > 4096 Then Progress.Value = 0
End If
If UBound(Data) < Progress.Max Then
Progress.Value = Progress.Value + UBound(Data)
End If
For i = 0 To UBound(Data)
csum = csum + Data(i)
' wrap around overflow If (csum < 0) Then csum = (csum And &H7FFFFFF) + 1
Next i
End If
If DocOutput.State <> icDocEnd Then Exit Sub
' Only do this part at the end
If csum <> CkSum(nextct - 1) And Not errflag Then
'Add to list
List1.AddItem URLs(nextct - 1)
CkSum(nextct - 1) = csum
End If
BusyFlag = False ' tell timer to go ahead
End Sub
Private Sub HTTP1_Error(Number As Integer, Description As String, _
Scode As Long, Source As String, HelpFile As String, _
HelpContext As Long, CancelDisplay As Boolean)
' show error text
MsgBox Description
Status.Caption = "Last error: " + Description
BusyFlag = False
End Sub
Private Sub List1_Click()
If List1.ListIndex = -1 Then
View.Enabled = False
Else
View.Enabled = True
End If
End Sub
Private Sub ReScan_Click()
ScanURLs
End Sub
Private Sub Timer1_Timer()
If nextct >= ct Then ' if done
If BusyFlag Then Exit Sub ' wait for it!
' Turn off everything
Timer1.Enabled = False
Progress.Visible = False
' And write out new file
SaveFile
ReScan.Enabled = True
Status.Caption = "Operation Complete"
Else
If Not BusyFlag Then
' Go to next URL
BusyFlag = True
nextct = nextct + 1
Status.Caption = "Scanning " + URLs(nextct - 1)
On Error Resume Next
HTTP1.GetDoc URLs(nextct - 1)
If Err.Number <> 0 Then
If Err.Number = 1102 Then MsgBox "Invalid URL: " + URLs(nextct - 1)
Status.Caption = "Last error: Invalid URL"
BusyFlag = False
Exit Sub
Else
MsgBox "Unknown error occured"
End If
Resume
End If
End If
End If
End Sub
Private Sub Tips_Click()
TipsForm.Show
End Sub
Private Sub View_Click()
URL = List1.Text
HTMLForm.Show vbModal
End Sub
Listing Two
VERSION 4.00
Begin VB.Form HTMLForm
Caption = "HTML"
ClientHeight = 5355
ClientLeft = 585
ClientTop = 945
ClientWidth = 6960
Height = 5790
Left = 525
LinkTopic = "Form1"
ScaleHeight = 357
ScaleMode = 3 'Pixel
ScaleWidth = 464
Top = 570
Visible = 0 'False
Width = 7080
Begin HTMLObjects.HTML HTML1
Height = 5415
Left = 0
TabIndex = 0
Top = 0
Width = 6975
_ExtentX = 12303
_ExtentY = 9551
DeferRetrieval = 0 'False
ElemNotification= 0 'False
UseDocColors = -1 'True
UnderlineLinks = -1 'True
Redraw = -1 'True
ViewSource = 0 'False
RetainSource = -1 'True Timeout = 30
BackImage = ""
BackColor = 12632256
ForeColor = 0
LinkColor = 16711680
VisitedColor = 16711935
BeginProperty RegularFont {0BE35203-8F91-11CE-9DE3-00AA004BB851}
name = "MS Sans Serif"
charset = 0
weight = 400
size = 8.25
underline = 0 'False
italic = 0 'False
strikethrough = 0 'False
EndProperty
BeginProperty FixedFont {0BE35203-8F91-11CE-9DE3-00AA004BB851}
name = "MS Sans Serif"
charset = 0
weight = 400
size = 8.25
underline = 0 'False
italic = 0 'False
strikethrough = 0 'False
EndProperty
BeginProperty Heading1Font {0BE35203-8F91-11CE-9DE3-00AA004BB851}
name = "MS Sans Serif"
charset = 0
weight = 400
size = 8.25
underline = 0 'False
italic = 0 'False
strikethrough = 0 'False
EndProperty
BeginProperty Heading2Font {0BE35203-8F91-11CE-9DE3-00AA004BB851}
name = "MS Sans Serif"
charset = 0
weight = 400
size = 8.25
underline = 0 'False
italic = 0 'False
strikethrough = 0 'False
EndProperty
BeginProperty Heading3Font {0BE35203-8F91-11CE-9DE3-00AA004BB851}
name = "MS Sans Serif"
charset = 0
weight = 400
size = 8.25
underline = 0 'False
italic = 0 'False
strikethrough = 0 'False
EndProperty
BeginProperty Heading4Font {0BE35203-8F91-11CE-9DE3-00AA004BB851}
name = "MS Sans Serif"
charset = 0
weight = 400 size = 8.25
underline = 0 'False
italic = 0 'False
strikethrough = 0 'False
EndProperty
BeginProperty Heading5Font {0BE35203-8F91-11CE-9DE3-00AA004BB851}
name = "MS Sans Serif"
charset = 0
weight = 400
size = 8.25
underline = 0 'False
italic = 0 'False
strikethrough = 0 'False
EndProperty
BeginProperty Heading6Font {0BE35203-8F91-11CE-9DE3-00AA004BB851}
name = "MS Sans Serif"
charset = 0
weight = 400
size = 8.25
underline = 0 'False
italic = 0 'False
strikethrough = 0 'False
EndProperty
End
End
Attribute VB_Name = "HTMLForm"
Attribute VB_Creatable = False
Attribute VB_Exposed = False
Private Sub Form_Load()
HTMLForm.Caption = URL
HTML1.RequestDoc URL
End Sub
Private Sub Form_Resize()
' If form resizes, resize control too!
HTML1.Width = HTMLForm.ScaleWidth
HTML1.Height = HTMLForm.ScaleHeight
End Sub
Private Sub HTML1_Error(Number As Integer, Description As String, _
Scode As Long, Source As String, HelpFile As String, _
HelpContext As Long, CancelDisplay As Boolean)
MsgBox Description
End Sub