File access in the fast lane
Leon is the senior managing engineer for a software developing firm specializing in medical practice management systems. Leon can be reached at Dental Plan Inc., 3633 Broadway, Garland, TX 75043.
This article presents FASTSRCH, a Basic function that scans a data file and returns an array containing the record numbers of all records that match a specified key. FASTSRCH applies a couple of basic concepts to attain very fast sequential access to data files: First, it reads and processes many records at a time using Basic's binary file I/O, yielding fewer Disk Read Requests to DOS. Secondly, it uses Basic's INSTR function to "instantly" scan for records meeting the desired criteria.
The idea is to scan the data file as fast as the disk drive can access it. This is not attained in the traditional record processing as there is a great deal of overhead that goes along with retrieving each record, comparing the key against a specific value, and iterating a loop counter.
We use FASTSRCH in almost every case where we used to scan files sequentially. The areas where you see the most dramatic gains are with smaller records. However, speed increases with almost any record length. This is especially true when incorporating Instr-ASM in lieu of Basic's INSTR. (See the accompanying text box, "Beyond FASTSRCH.")
On transactional files having record lengths less than about 100 bytes, the speed increase is amazing. For example, one application, which finds and displays all transactions for a certain customer, searches a summary transaction file that has a record length of 32 bytes. I benchmarked the old code on a large data set which took almost 1 minute, 20 seconds. Using FASTSRCH, the file was processed in 6 seconds!
Calling FASTSRCH
FASTSRCH is easy to call in a program. First, you dimension an array that will hold the found record numbers, then open the file to be searched, and then execute the call. When FASTSRCH returns control, the array has all matching record numbers loaded and is ready to go!
Figure 1 could be from a program that scans a transaction file and displays all records that belong to a particular customer. This example shows how it might be done in a traditional manner. In Figure 2, FASTSRCH is used for the scan. The first difference when using FASTSRCH is that we must set the maximum number of records we can retrieve in order to dimension the array that will receive the selected records' record numbers. The next difference is we open the file in binary access mode instead of random access mode. This allows FASTSRCH to process many records at once. But the greatest difference in the calling program's logic comes after the call to FASTSRCH. Here, we close and reopen the file in random access mode. Then, instead of looping through all records in the file, searching for matching ones, FASTSRCH has already identified the records we need, and hence, we can access the data directly. The seven parameters to pass to FASTSRCH are listed in Table 1; Table 2 lists the returned values.
Figure 1: Scanning a file without FASTSRCH
OPEN "TRANSACT.DAT" FOR RANDOM AS #1 LEN = LEN(T) FOR I% = 1 TO NumRecs% GET #1, I%, T IF KeyVal$ = T. CustNum THEN CALL Print1Rec(T) NEXT
Figure 2: Scanning a file using FASTSRCH
DIM FoundRecs% (MaxToFind%) OPEN "TRANSACT.DAT" FOR BINARY AS #1 NumFound% = FASTSRCH (1, LEN(T), 7, KeyVal$, 1, LastRec%, FoundRecs%()) CLOSE #1 OPEN "TRANSACT.DAT" FOR RANDOM AS #1 LEN = LEN(T) FOR I% = 1 TO NumFound% GET #1, FoundRecs%(I%), T CALL Print1Rec(T) NEXT
Table 1: Parameters passed to FASTSRCH
Parameter Function BuffFile% The file handle used in the OPEN statement. RecLen% The fixed length of the data record. KeyLoc% The starting byte of the field being matched on. KeyVal$ The value that the key needs to have to be a "match". StartRec% The first record to begin searching on. LastRec% The record number of the last number to search. FoundRecs%() The array that will hold the record numbers of matching records.
Table 2: Values returned by FASTSRCH
Parameter Function FoundRecs%() When this array is passed back, it will contain the matching record numbers. FASTSRCH Returns the number of matches found.
Examining FASTSRCH
FASTSRCH (Listing One, page 90), which was compiled with Microsoft Basic 7.1, calculates and loads as many records as it can into a large file buffer. Then it searches each byte in the buffer for a match on the key. When it finds a match, it uses modular arithmetic (remainder after division) to find out if the matched data is in the key's location. If it is, it calculates the physical record number of the matched key's record and adds it to the array of matching records; otherwise, it continues to search the rest of the buffer. When the entire buffer is searched, it loads the next group of records into the buffer and starts over.
In detail, FASTSRCH starts out by creating a 10,000-byte file buffer in the far heap. My consultant, Bill McGill and I have found that a buffer larger than this doesn't seem to speed things up much. Next we perform some initialization.
Note that we check to see if the key that we will be matching on starts on the last byte of the logical record length (even though this would rarely be the case). If it does, we set the key location compare variable to zero instead of the key location. (Because we will later be using modulo arithmetic to verify that a match is at the key location.) If the key location is the last byte of the record, the remainder will be zero instead of the record length. We use integer division to calculate the number of whole records that can fit into our file buffer: This is the maximum amount of buffer that we will use.
The outer loop executes once for each buffer full of records we search. The first thing we do is calculate the byte in the file that is the first byte we will load into the buffer. Then, starting at that byte, we retrieve the data into the buffer. Next, if this is the last buffer and it is not full, we restrict the number of bytes to search.
The middle loop executes each time we find a matching record in this buffer. The innermost loop exists because Basic's INSTR will return a "false match" if there is a match not in the key's location. This loop, therefore, is needed to weed out the false matches. When INSTR finds a match, we check to make sure it isn't past our logical buffer length. If it is, or there just weren't any matches at all, we exit the loop. If there was a match then, the LOOP UNTIL makes sure that we are on the key position before exiting the loop and adding this record to our other found records. Otherwise, we loop again, continuing from the byte where we left off.
When we fall out of this inner loop, we check to see if there was a match, and if there is room for it in our array. If so, we calculate the physical record number and add it to the array.
This middle loop will continue until there are no more matches in this buffer. The next logical record number is calculated and we loop again if we are not past the end of the file. Finally, we set the number of records found to FASTSRCH%.
Modifying FASTSRCH
The FASTSRCH function listed with this article is a bare-bone version to use for discussion. Once you understand the concept behind it, you can customize and enhance FASTSRCH to improve performance, flexibility, and presentation. Here are a few considerations:
- If the file being searched can grow to have more than about 32,000 records, then a long integer array should be used, and FASTSRCH itself may need to return a long integer.
- If records longer than about 100 bytes are to be searched, or if there will be many false matches (matches not in the key's location) then a smarter, more sophisticated version of INSTR can be used to enhance performance. We have written such a routine in assembler called InstrASM.
- If you need more flexible searching capabilities, such as greater than or less than a specified value, you can use a slightly modified InstrASM.
- FASTSRCH is very conducive to having a "percent finished" indicator displayed on the screen without slowing things down. Simply set up the window before calling FASTSRCH and add the following line as the first statement inside the outer DO loop: LOCATE Row%, Col%, 0: PRINT USING "###"; (CurRec%/ LastRec%) * 100;
- If it is likely that the number of matches is not predictable then you can have FASTSRCH return "-1" to indicate the capacity of the dimensioned array was exceeded. Then the data can be processed and control could be passed back to FASTSRCH to process the rest of the file.
Beyond FASTSRCH
FASTSRCH has a few limitations that can be overcome by using customized, low-level searching routines in place of INSTR. Performance gain can be lessened when searching long records. Because FASTSRCH uses Basic's INSTR to find a match, every byte in the buffer must be searched sequentially. When the records are short, there are not many extraneous comparisons. But when records are long, say 256 bytes, up to 255 extraneous compares are being made for each record. This problem becomes compounded when the value of the key being compared matches in other places in the records. These "false" matches can further degrade performance.
A solution to the problem is to create a smarter INSTR function that knows at which position in the records the key is located, and checks only this key data. For this to execute as quickly as possible, the core of the search must be written in assembler. We have written two routines: InstrR (see Listing Two, page 90) written in Basic and compiled with Microsoft Basic 7.1, is called by FASTSRCH, and InstrASM ( Listing Three, page 90), written in assembler and assembled with MASM 5.1, is called by InstrR. InstrR basically does the setup for InstrASM, which actually performs the search.
In order to implement InstrR, FASTSRCH needs to be simplified slightly. Because InstrR only returns matches on the key (no false matches) then the entire inner loop of FASTSRCH can be replaced with the lines in Figure 3(a). Also, the formula for calculating the matching record's actual record number must be simplified as shown in Figure 3(b).
Figure 3: Implementing InstrR
(a) Match% = InstrR (Match% + 1, FarBuff( ), KeyVal$, RecLen%, KeyPos%) IF Match% > BuffLen% then Match% = 0 (b) FoundRecs% (NumMatches%) = Match% + CurRec% - 1
Using Conditionals Other than "Equal to"
Another limitation of the FASTSRCH when using INSTR is its ability to select records based only on matching a key value exactly. However, when using InstrASM, we have control over which conditional we use to make selections. Instead of using JE (jump if equal), we can use JA (jump if above), or JB (jump if below) any other conditional. Changing the one jump statement is all it takes to make this modification, which adds a tremendous amount of flexibility to our searching capabilities.
-- L.C.
_FAST SEARCH_ by Leon Campise
[LISTING ONE]<a name="01d6_0010">
FUNCTION FASTSRCH% (BuffFile%, RecLen%, KeyLoc%, KeyVal$, StartRec%,
LastRec%, FoundRecs%())
CONST physbufflen% = 10000
DIM FarBuff(Dum%) AS STRING * physbufflen '* string buffer in the far heap
MaxFound% = UBOUND(FoundRecs%) '* can only find as many as array will hold
NumMatches% = 0 '* initialize the number of matches found
IF KeyLoc% = RecLen% THEN '* if key is the last byte in the record:
CompByte% = 0 : LastByte% = -1 '* modulo position to check will be 0
ELSE '* and when calculating record number
CompByte%=KeyLoc% : LastByte%=0 '* of the matching record,
'* 1 must be subtracted
END IF
RecsPerRead% = physbufflen \ RecLen%
'* calc. the # of records scanned with each disk read
BuffLen% = RecsPerRead% * RecLen%
'* the actual number of bytes in the buffer being used
CurRec% = StartRec% '* initialize CurRec
DO '* OUTER LOOP * This loop executes for each disk read
BuffByte& = ((CurRec% - 1&) * RecLen%) + 1 '* first byte in next buffer
GET BuffFile%, BuffByte&, FarBuff(0) '* perform the actual disk i/o
Match% = 0 '* initialize match to false
IF CurRec% + RecsPerRead% - 1 > LastRec% THEN '* when the last disk
BuffLen% = (LastRec% - CurRec% + 1) * RecLen% '* read goes past the
END IF '* number of records
'* to search, shorten
'* the number of bytes
'* in buffer to scan
DO '* MIDDLE LOOP * Loops for each match found in this buffer
DO '* INNER LOOP * Loops for each "false" match
Match% = INSTR(Match% + 1, FarBuff(0), KeyVal$)
Match% > BuffLen% THEN Match% = 0
'* if match was past valid data, ignore the match
IF Match% = 0 THEN EXIT DO '* no more matches in this buffer
LOOP WHILE Match% MOD RecLen% <> CompByte%
'* keep searching if the match was in the wrong position
IF Match% > 0 AND NumMatches% < MaxFound% THEN
'* if a match and the array isn't full
NumMatches% = NumMatches% + 1
'* add into the array of found records
FoundRecs%(NumMatches%) = _
((Match% + LastByte%) \ RecLen%) + CurRec%
'* CALCULATE THE ACTUAL RECORD NUMBER *
END IF
LOOP UNTIL Match% = 0 '* no more matches in this buffer, exit loop
CurRec% = CurRec% + RecsPerRead% '* calc. first record of next buffer
LOOP UNTIL CurRec% > LastRec% '* if 1st rec. of next buffer is > number
'* of recs to search then were finished
FASTSRCH% = NumMatches% '* set return value = number of matches
END FUNCTION
<a name="01d6_0011"><a name="01d6_0011">
<a name="01d6_0012">
[LISTING TWO]<a name="01d6_0012">
DECLARE FUNCTION InstrASM% (BYVAL BuffSeg%, BYVAL BuffOff%, BYVAL BuffLen%,_
BYVAL KeySeg%, BYVAL KeyOff%, BYVAL KeyLen%, BYVAL RecLen%)
FUNCTION InstrR% (StRec%, Buff() AS STRING * 10000, KeyVal$, RecLen%,
KeyPos%) STATIC
KeyLen% = LEN(KeyVal$) '* Get the length of the key
StartingOffs% = ((StRec% - 1) * RecLen%) + KeyPos% - 1
'* Calculate the 1st byte to search
BuffLen% = LEN(Buff(0)) - StartingOffs% '* Calculate # of bytes to search
BuffSeg% = SSEG(Buff(0)) '* Locate the buffer's segment
BuffOff% = SADD(Buff(0)) + StartingOffs% '* and offset.
KeySeg% = SSEG(KeyVal$) '* Locate the key's segment
KeyOff% = SADD(KeyVal$) '* and offset.
FoundPos% = InstrASM%(BuffSeg%, BuffOff%, BuffLen%, _
KeySeg%, KeyOff%, KeyLen%, RecLen%) '* Perform a search
IF FoundPos% = 0 THEN '* If no match,
InstrR% = 0 '* set the record number to 0
ELSE '* otherwise calculate the record number
InstrR% = (FoundPos% + StartingOffs% + RecLen% - 1) \ RecLen%
END IF
END FUNCTION
<a name="01d6_0013"><a name="01d6_0013">
<a name="01d6_0014">
[LISTING THREE]<a name="01d6_0014">
.MODEL MEDIUM, BASIC
.CODE
INSTRASM PROC FAR USES DS SI DI, BuffSeg,BuffStartAdd,BuffLen,KeySeg,\
KeyAddr,KeyLen,RecLen
PUSHF
MOV BX,BuffSeg ; Move the Buffer Segment
MOV ES, BX ; into the EXTRA SEGMENT REGISTER
MOV BX,KeySeg ; Move the Key String Segment
MOV DS,BX ; into the DATA SEGMENT REGISTER
MOV AX,RecLen ; Move the Record Length to AX
MOV DX,BuffLen ; Move the Length of the Buffer to DX
MOV BX,BuffStartAdd ; Move Buffer Starting Address to BX
ADD DX,BX ; Add Buffer Offset(BX) to Buffer Length(DX)
SUB DX,1 ; and subtract 1 for End of Buffer Address
CLD ; Clear the direction flag
FinishedYet:
CMP BX,DX ; Next Record Address>End Of Buffer Address?
JA NoMoreBuff ; if so, quit.
MOV DI,BX ; Move Next Record Address to Dest. Index
MOV SI,KeyAddr ; Move Key Address to Source Index
MOV CX,KeyLen ; Move Key Length to CX as REPE counter
REPE CMPSB ; Compare strings and if equal
JE Match ; go to Match (* can use JA or JB to alter *)
ADD BX,AX ; Add Record Length(AX) to Next Record Address
JMP FinishedYet ; Loop back and continue checking for a match
NoMoreBuff:
XOR AX,AX ; Set AX to 0 for return in function value
JMP TheEnd ; Go to end
Match:
MOV AX,BX ; Move Current Record Offset to AX for return
SUB AX,BuffStartAdd ; Subtract Starting Address of Buffer and
ADD AX,1 ; add 1 to get actual offset w/i Buffer
TheEnd:
POPF
RET
INSTRASM ENDP
END
[Example 1: Scanning a file without FASTSRCH]
OPEN "TRANSACT.DAT" FOR RANDOM AS #1 LEN = LEN(T)
FOR I% = 1 TO NumRecs%
GET #1, I%, T
IF KeyVal$ = T.CustNum THEN CALL Print1Rec(T)
NEXT
[Example 2: Scanning a file using FASTSRCH]
DIM FoundRecs%(MaxToFind%)
OPEN "TRANSACT.DAT" FOR BINARY AS #1
NumFound% = FASTSRCH(1, LEN(T), 7, KeyVal$, 1, LastRec%, FoundRecs%())
CLOSE #1
OPEN "TRANSACT.DAT" FOR RANDOM AS #1 LEN = LEN(T)
FOR I% = 1 TO NumFound%
GET #1, FoundRecs%(I%), T
CALL Print1Rec(T)
NEXT
[Example 3: Implementing InstrR]
(a)
Match% = InstrR(Match% + 1, FarBuff(), KeyVal$, RecLen%, KeyPos%)
IF Match% > BuffLen% then Match% = 0
(b)
FoundRecs%(NumMatches%) = Match% + CurRec% - 1