Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Database

Fast Search


SEP90: FAST SEARCH

File access in the fast lane

Leon is the senior managing engineer for a software developing firm specializing in medical practice management systems. Leon can be reached at Dental Plan Inc., 3633 Broadway, Garland, TX 75043.


This article presents FASTSRCH, a Basic function that scans a data file and returns an array containing the record numbers of all records that match a specified key. FASTSRCH applies a couple of basic concepts to attain very fast sequential access to data files: First, it reads and processes many records at a time using Basic's binary file I/O, yielding fewer Disk Read Requests to DOS. Secondly, it uses Basic's INSTR function to "instantly" scan for records meeting the desired criteria.

The idea is to scan the data file as fast as the disk drive can access it. This is not attained in the traditional record processing as there is a great deal of overhead that goes along with retrieving each record, comparing the key against a specific value, and iterating a loop counter.

We use FASTSRCH in almost every case where we used to scan files sequentially. The areas where you see the most dramatic gains are with smaller records. However, speed increases with almost any record length. This is especially true when incorporating Instr-ASM in lieu of Basic's INSTR. (See the accompanying text box, "Beyond FASTSRCH.")

On transactional files having record lengths less than about 100 bytes, the speed increase is amazing. For example, one application, which finds and displays all transactions for a certain customer, searches a summary transaction file that has a record length of 32 bytes. I benchmarked the old code on a large data set which took almost 1 minute, 20 seconds. Using FASTSRCH, the file was processed in 6 seconds!

Calling FASTSRCH

FASTSRCH is easy to call in a program. First, you dimension an array that will hold the found record numbers, then open the file to be searched, and then execute the call. When FASTSRCH returns control, the array has all matching record numbers loaded and is ready to go!

Figure 1 could be from a program that scans a transaction file and displays all records that belong to a particular customer. This example shows how it might be done in a traditional manner. In Figure 2, FASTSRCH is used for the scan. The first difference when using FASTSRCH is that we must set the maximum number of records we can retrieve in order to dimension the array that will receive the selected records' record numbers. The next difference is we open the file in binary access mode instead of random access mode. This allows FASTSRCH to process many records at once. But the greatest difference in the calling program's logic comes after the call to FASTSRCH. Here, we close and reopen the file in random access mode. Then, instead of looping through all records in the file, searching for matching ones, FASTSRCH has already identified the records we need, and hence, we can access the data directly. The seven parameters to pass to FASTSRCH are listed in Table 1; Table 2 lists the returned values.

Figure 1: Scanning a file without FASTSRCH

  OPEN "TRANSACT.DAT" FOR RANDOM AS #1 LEN = LEN(T)
  FOR I% = 1 TO NumRecs%
           GET #1, I%, T
           IF KeyVal$ = T. CustNum THEN CALL Print1Rec(T)
  NEXT

Figure 2: Scanning a file using FASTSRCH

  DIM FoundRecs% (MaxToFind%)
  OPEN "TRANSACT.DAT" FOR BINARY AS #1
  NumFound% = FASTSRCH (1, LEN(T), 7, KeyVal$, 1, LastRec%, FoundRecs%())
  CLOSE #1
  OPEN "TRANSACT.DAT" FOR RANDOM AS #1 LEN = LEN(T)
  FOR I% = 1 TO NumFound%
           GET #1, FoundRecs%(I%), T
           CALL Print1Rec(T)
  NEXT

Table 1: Parameters passed to FASTSRCH

  Parameter     Function

  BuffFile%     The file handle used
                in the OPEN statement.

  RecLen%       The fixed length of
                the data record.

  KeyLoc%       The starting byte of
                the field being
                matched on.

  KeyVal$       The value that the
                key needs to have
                to be a "match".

  StartRec%     The first record to
                begin searching on.

  LastRec%      The record number
                of the last number
                to search.

  FoundRecs%()  The array that will
                hold the record numbers
                of matching records.

Table 2: Values returned by FASTSRCH

  Parameter     Function

  FoundRecs%()  When this array
                is passed back, it
                will contain the
                matching record
                numbers.

  FASTSRCH      Returns the
                number of
                matches found.

Examining FASTSRCH

FASTSRCH (Listing One, page 90), which was compiled with Microsoft Basic 7.1, calculates and loads as many records as it can into a large file buffer. Then it searches each byte in the buffer for a match on the key. When it finds a match, it uses modular arithmetic (remainder after division) to find out if the matched data is in the key's location. If it is, it calculates the physical record number of the matched key's record and adds it to the array of matching records; otherwise, it continues to search the rest of the buffer. When the entire buffer is searched, it loads the next group of records into the buffer and starts over.

In detail, FASTSRCH starts out by creating a 10,000-byte file buffer in the far heap. My consultant, Bill McGill and I have found that a buffer larger than this doesn't seem to speed things up much. Next we perform some initialization.

Note that we check to see if the key that we will be matching on starts on the last byte of the logical record length (even though this would rarely be the case). If it does, we set the key location compare variable to zero instead of the key location. (Because we will later be using modulo arithmetic to verify that a match is at the key location.) If the key location is the last byte of the record, the remainder will be zero instead of the record length. We use integer division to calculate the number of whole records that can fit into our file buffer: This is the maximum amount of buffer that we will use.

The outer loop executes once for each buffer full of records we search. The first thing we do is calculate the byte in the file that is the first byte we will load into the buffer. Then, starting at that byte, we retrieve the data into the buffer. Next, if this is the last buffer and it is not full, we restrict the number of bytes to search.

The middle loop executes each time we find a matching record in this buffer. The innermost loop exists because Basic's INSTR will return a "false match" if there is a match not in the key's location. This loop, therefore, is needed to weed out the false matches. When INSTR finds a match, we check to make sure it isn't past our logical buffer length. If it is, or there just weren't any matches at all, we exit the loop. If there was a match then, the LOOP UNTIL makes sure that we are on the key position before exiting the loop and adding this record to our other found records. Otherwise, we loop again, continuing from the byte where we left off.

When we fall out of this inner loop, we check to see if there was a match, and if there is room for it in our array. If so, we calculate the physical record number and add it to the array.

This middle loop will continue until there are no more matches in this buffer. The next logical record number is calculated and we loop again if we are not past the end of the file. Finally, we set the number of records found to FASTSRCH%.

More Details.

Modifying FASTSRCH

The FASTSRCH function listed with this article is a bare-bone version to use for discussion. Once you understand the concept behind it, you can customize and enhance FASTSRCH to improve performance, flexibility, and presentation. Here are a few considerations:

  • If the file being searched can grow to have more than about 32,000 records, then a long integer array should be used, and FASTSRCH itself may need to return a long integer.
  • If records longer than about 100 bytes are to be searched, or if there will be many false matches (matches not in the key's location) then a smarter, more sophisticated version of INSTR can be used to enhance performance. We have written such a routine in assembler called InstrASM.
  • If you need more flexible searching capabilities, such as greater than or less than a specified value, you can use a slightly modified InstrASM.
  • FASTSRCH is very conducive to having a "percent finished" indicator displayed on the screen without slowing things down. Simply set up the window before calling FASTSRCH and add the following line as the first statement inside the outer DO loop: LOCATE Row%, Col%, 0: PRINT USING "###"; (CurRec%/ LastRec%) * 100;
  • If it is likely that the number of matches is not predictable then you can have FASTSRCH return "-1" to indicate the capacity of the dimensioned array was exceeded. Then the data can be processed and control could be passed back to FASTSRCH to process the rest of the file.
When it comes to accessing data files quickly without the hassle of maintaining sophisticated linked lists, or using database engines, FASTSRCH can be the answer to the "Searching, Please wait ..." blues.

Beyond FASTSRCH

FASTSRCH has a few limitations that can be overcome by using customized, low-level searching routines in place of INSTR. Performance gain can be lessened when searching long records. Because FASTSRCH uses Basic's INSTR to find a match, every byte in the buffer must be searched sequentially. When the records are short, there are not many extraneous comparisons. But when records are long, say 256 bytes, up to 255 extraneous compares are being made for each record. This problem becomes compounded when the value of the key being compared matches in other places in the records. These "false" matches can further degrade performance.

A solution to the problem is to create a smarter INSTR function that knows at which position in the records the key is located, and checks only this key data. For this to execute as quickly as possible, the core of the search must be written in assembler. We have written two routines: InstrR (see Listing Two, page 90) written in Basic and compiled with Microsoft Basic 7.1, is called by FASTSRCH, and InstrASM ( Listing Three, page 90), written in assembler and assembled with MASM 5.1, is called by InstrR. InstrR basically does the setup for InstrASM, which actually performs the search.

In order to implement InstrR, FASTSRCH needs to be simplified slightly. Because InstrR only returns matches on the key (no false matches) then the entire inner loop of FASTSRCH can be replaced with the lines in Figure 3(a). Also, the formula for calculating the matching record's actual record number must be simplified as shown in Figure 3(b).

Figure 3: Implementing InstrR

  (a)

  Match% = InstrR (Match% + 1, FarBuff( ), KeyVal$, RecLen%, KeyPos%)
  IF Match% > BuffLen% then Match% = 0

  (b)

  FoundRecs% (NumMatches%) = Match% + CurRec% - 1

Using Conditionals Other than "Equal to"

Another limitation of the FASTSRCH when using INSTR is its ability to select records based only on matching a key value exactly. However, when using InstrASM, we have control over which conditional we use to make selections. Instead of using JE (jump if equal), we can use JA (jump if above), or JB (jump if below) any other conditional. Changing the one jump statement is all it takes to make this modification, which adds a tremendous amount of flexibility to our searching capabilities.

-- L.C.

_FAST SEARCH_ by Leon Campise

[LISTING ONE]

<a name="01d6_0010">

FUNCTION FASTSRCH% (BuffFile%, RecLen%, KeyLoc%, KeyVal$, StartRec%,
                                                       LastRec%, FoundRecs%())

    CONST physbufflen% = 10000

    DIM FarBuff(Dum%) AS STRING * physbufflen  '* string buffer in the far heap
    MaxFound% = UBOUND(FoundRecs%)  '* can only find as many as array will hold
    NumMatches% = 0                 '* initialize the number of matches found

    IF KeyLoc% = RecLen% THEN         '* if key is the last byte in the record:
     CompByte% = 0 : LastByte% = -1   '* modulo position to check will be 0
    ELSE                              '* and when calculating record number
     CompByte%=KeyLoc% : LastByte%=0  '* of the matching record,
                                      '* 1 must be subtracted
    END IF

    RecsPerRead% = physbufflen \ RecLen%
                         '* calc. the # of records scanned with each disk read
    BuffLen% = RecsPerRead% * RecLen%
                         '* the actual number of bytes in the buffer being used
    CurRec% = StartRec%  '* initialize CurRec

    DO                   '* OUTER LOOP *  This loop executes for each disk read
       BuffByte& = ((CurRec% - 1&) * RecLen%) + 1  '* first byte in next buffer
       GET BuffFile%, BuffByte&, FarBuff(0)     '* perform the actual disk i/o
       Match% = 0                               '* initialize match to false
       IF CurRec% + RecsPerRead% - 1 > LastRec% THEN     '* when the last disk
           BuffLen% = (LastRec% - CurRec% + 1) * RecLen% '* read goes past the
       END IF                                            '* number of records
                                                         '* to search, shorten
                                                         '* the number of bytes
                                                         '* in buffer to scan

       DO         '* MIDDLE LOOP * Loops for each match found in this buffer
           DO     '* INNER LOOP * Loops for each "false" match
               Match% = INSTR(Match% + 1, FarBuff(0), KeyVal$)
               Match% > BuffLen% THEN Match% = 0
                             '* if match was past valid data, ignore the match
               IF Match% = 0 THEN EXIT DO     '* no more matches in this buffer
           LOOP WHILE Match% MOD RecLen% <> CompByte%
                       '* keep searching if the match was in the wrong position
           IF Match% > 0 AND NumMatches% < MaxFound% THEN
                                         '* if a match and the array isn't full
               NumMatches% = NumMatches% + 1
                                         '* add into the array of found records
               FoundRecs%(NumMatches%) = _
                  ((Match% + LastByte%) \ RecLen%) + CurRec%
                                       '* CALCULATE THE ACTUAL RECORD NUMBER *
           END IF
       LOOP UNTIL Match% = 0      '* no more matches in this buffer, exit loop
        CurRec% = CurRec% + RecsPerRead%  '* calc. first record of next buffer
    LOOP UNTIL CurRec% > LastRec%     '* if 1st rec. of next buffer is > number
                                      '* of recs to search then were finished
    FASTSRCH% = NumMatches%           '* set return value = number of matches

END FUNCTION




<a name="01d6_0011"><a name="01d6_0011">
<a name="01d6_0012">
[LISTING TWO]
<a name="01d6_0012">

DECLARE FUNCTION InstrASM% (BYVAL BuffSeg%, BYVAL BuffOff%, BYVAL BuffLen%,_
                 BYVAL KeySeg%, BYVAL KeyOff%, BYVAL KeyLen%, BYVAL RecLen%)

FUNCTION InstrR% (StRec%, Buff() AS STRING * 10000, KeyVal$, RecLen%,
                                                                KeyPos%) STATIC

    KeyLen% = LEN(KeyVal$)                         '* Get the length of the key

    StartingOffs% = ((StRec% - 1) * RecLen%) + KeyPos% - 1
                                            '* Calculate the 1st byte to search
    BuffLen% = LEN(Buff(0)) - StartingOffs%  '* Calculate # of bytes to search
    BuffSeg% = SSEG(Buff(0))                 '* Locate the buffer's segment
    BuffOff% = SADD(Buff(0)) + StartingOffs% '* and offset.

    KeySeg% = SSEG(KeyVal$)                   '* Locate the key's segment
    KeyOff% = SADD(KeyVal$)                   '* and offset.

    FoundPos% = InstrASM%(BuffSeg%, BuffOff%, BuffLen%, _
                       KeySeg%, KeyOff%, KeyLen%, RecLen%) '* Perform a search

    IF FoundPos% = 0 THEN              '* If no match,
        InstrR% = 0                    '* set the record number to 0
    ELSE                               '* otherwise calculate the record number
        InstrR% = (FoundPos% + StartingOffs% + RecLen% - 1) \ RecLen%
    END IF

END FUNCTION




<a name="01d6_0013"><a name="01d6_0013">
<a name="01d6_0014">
[LISTING THREE]
<a name="01d6_0014">

.MODEL MEDIUM, BASIC
.CODE
INSTRASM  PROC  FAR  USES DS SI DI, BuffSeg,BuffStartAdd,BuffLen,KeySeg,\
                                        KeyAddr,KeyLen,RecLen
        PUSHF
        MOV     BX,BuffSeg      ; Move the Buffer Segment
        MOV     ES, BX          ;   into the EXTRA SEGMENT REGISTER
        MOV     BX,KeySeg       ; Move the Key String Segment
        MOV     DS,BX           ;   into the DATA SEGMENT REGISTER
        MOV     AX,RecLen       ; Move the Record Length to AX
        MOV     DX,BuffLen      ; Move the Length of the Buffer to DX
        MOV     BX,BuffStartAdd ; Move Buffer Starting Address to BX
        ADD     DX,BX           ; Add Buffer Offset(BX) to Buffer Length(DX)
        SUB     DX,1            ;   and subtract 1 for End of Buffer Address
        CLD                     ; Clear the direction flag

FinishedYet:
        CMP     BX,DX           ; Next Record Address>End Of Buffer Address?
        JA      NoMoreBuff      ;   if so, quit.

        MOV     DI,BX           ; Move Next Record Address to Dest. Index
        MOV     SI,KeyAddr      ; Move Key Address to Source Index
        MOV     CX,KeyLen       ; Move Key Length to CX as REPE counter

        REPE CMPSB              ; Compare strings and if equal
        JE      Match           ;   go to Match (* can use JA or JB to alter *)

        ADD     BX,AX           ; Add Record Length(AX) to Next Record Address
        JMP     FinishedYet     ; Loop back and continue checking for a match

NoMoreBuff:
        XOR     AX,AX           ; Set AX to 0 for return in function value
        JMP     TheEnd          ; Go to end

Match:
        MOV     AX,BX           ; Move Current Record Offset to AX for return
        SUB     AX,BuffStartAdd ;   Subtract Starting Address of Buffer and
        ADD     AX,1            ;   add 1 to get actual offset w/i Buffer

TheEnd:
        POPF
        RET

INSTRASM ENDP
        END


[Example 1: Scanning a file without FASTSRCH]

OPEN "TRANSACT.DAT" FOR RANDOM AS #1 LEN = LEN(T)
FOR I% = 1 TO NumRecs%
         GET #1, I%, T
         IF KeyVal$ = T.CustNum THEN CALL Print1Rec(T)
NEXT


[Example 2: Scanning a file using FASTSRCH]

DIM FoundRecs%(MaxToFind%)
OPEN "TRANSACT.DAT" FOR BINARY AS #1
NumFound% = FASTSRCH(1, LEN(T), 7, KeyVal$, 1, LastRec%, FoundRecs%())
CLOSE #1
OPEN "TRANSACT.DAT" FOR RANDOM AS #1 LEN = LEN(T)
FOR I% = 1 TO NumFound%
         GET #1, FoundRecs%(I%), T
         CALL Print1Rec(T)
NEXT


[Example 3: Implementing InstrR]

(a)

Match% = InstrR(Match% + 1, FarBuff(), KeyVal$, RecLen%, KeyPos%)
IF Match% > BuffLen% then Match% = 0

(b)

FoundRecs%(NumMatches%) = Match% + CurRec% - 1










Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.