[VB6][WEB] Parsing HTML


Description
What is parsing?
Parsing is another term used for "interpreting". In most cases, it means extracting information from a string. For example, you may want to extract certain parts from the HTML source of a webpage. When parsing, you will be using the basic string manipulation functions from VB. These being: Left, Mid, Right, Trim, InStr, InStrRev, Split, etc. In most cases, it's usually just a combination of InStr and Mid (avoid using the Split function if possible).
It's really hard to teach "parsing" to someone because it's entirely dependent on what data you're working with. However, after some practice, you will start to notice patterns and realize that most parsing situations call for almost the same thing.


AUTHOR
Copyright (c) 2007 - Danny Elkins (DigiRev)
http://www.DannyDotGuitar.com/digirev/
DigiRev@Hotmail.com



How To
Scenario 1: HTML

Parsing HTML
In this scenario, we will be parsing some HTML returned from a webpage. Let's say we wanted to parse, or extract, the e-mail address from a webpage. The HTML looks like this:

<html>
  <head>
    <title>Parse this</title>
  </head>
  <body>
    <strong>User@Hotmail.com</strong>
  </body>
</html>

That is probably the most basic HTML you will ever come across. It is just to get a basic idea of how a common parsing routine works. It doesn't matter how "complex" or confusing the HTML looks. The parsing process will be the same.

Step 1: Find the start!
The first step in any extraction routine, is to find the start. What is the start? It is a constant value that comes immediately before what we're looking for. Most HTML isn't constant, so it's best to keep your routine as loose, but reliable as possible. It's also common to have to modify, or re-write some parsing routines in the future as the data changes.

Anyway, what is the start that we're looking for here? The first thing that comes before the e-mail address. In this case, it would be <strong>.

WAIT!

An important thing to consider when looking for the start, is: Does that data appear anywhere else in the HTML? Does it come before or after? For example, what if the HTML looked like this?

<html>
  <head>
    <title>Parse this</title>
  </head>
  <body>
    <strong>Welcome!</strong>
    <strong>User@Hotmail.com</strong>
  </body>
</html>

What would be the start there? You would have to find the 2nd instance of <strong>. What if the HTML looked like this?

<html>
  <head>
    <title>Parse this</title>
  </head>
  <body>
    <strong>Welcome!</strong>
    <a href="#"><strong>User@Hotmail.com</strong></a>
  </body>
</html>

What would we use as the start there? We wouldn't use just <strong>, obviously, because that would bring us to "Welcome!". We would use <a href="#"><strong>. That string only appears once. So, always find the most constant, unique, string to use as the start.

Finding the start
So, how do we find the start in the HTML? VB has a nice built-in function called InStr. I did assume you have some experience with basic string manipulation functions, but if not then read the comments...(and then go learn how to use it. J).

Dim lonPos As Long
Dim strStart As String

'The start string.
strStart = "<a href=""#""><strong>"

'Find the start string.
lonPos = InStr(1, HTML, strStart, vbTextCompare)

'1 - Where we start searching in the string (from the beginning).
'HTML - The string holding the HTML.
'strStart - What we're searching for.
'vbTextCompare - Case in-sensitive search (more reliable for HTML).
'vbBinaryCompare - Faster, but it's case-sensitive.

If the search string was found, lonPos will contain the starting position. The starting position would be the < in <a href="#".

Step 2: Find the end!
Yup, now that we found the start, we find the end. It's pretty simple. The end would be what comes immediately after what we're looking for. In this example, it would be </strong>. So, all we do is use the InStr function again. Except, this time, we will supply the function with lonPos and have it start searching from there. If we searched from the beginning, it would take us to the end of "Welcome!".

Dim lonPos As Long, lonEnd As Long
Dim strStart As String, strEnd As String
Dim strEmail As String

'The start string.
strStart = "<a href=""#""><strong>"
strEnd = "</strong>"

'Find the start string.
lonPos = InStr(1, HTML, strStart, vbTextCompare)

If lonPos > 0 Then
    'Move to the end of the start string
    'which happens to be the beginning of what we're looking for. :)
    lonPos = lonPos + Len(strStart)
   
    'Find the end string starting from where we found the start.
    lonEnd = InStr(lonPos, HTML, strEnd, vbTextCompare)
   
    If lonEnd > 0 Then
        'Now, we have the starting and ending position.
        'What we do is extract the information between them.
       
        'The length of data (e-mail address) will be:
        'lonEnd - lonPos
        strEmail = Mid$(HTML, lonPos, lonEnd - lonPos)
       
        'Done!
        MsgBox strEmail
    End If
End If

A little explanation:

If lonPos > 0 Then
Checks if we found the start. If InStr didn't find it, it will return 0.

            lonPos = lonPos + Len(strStart)
            This will take us from the beginning of the start string (X<a href="#"></strong>) to the end of the start string (<a href="#"></strong>X)
            At the end of the start string is what we're looking for (the e-mail address).

            lonEnd = InStr(lonPos, HTML, strEnd, vbTextCompare)
            The search will start from lonPos and will find strEnd (</strong>).

            If lonEnd > 0 Then
            If InStr found the ending string (</strong>) then...

                        strEmail = Mid$(HTML, lonPos, lonEnd - lonPos)
                        We are using Mid to extract something from the middle of the string.
                        We start at lonPos. This starts with the first character of the e-mail address.
                        We end at lonEnd - lonPos. That will equal the length of the e-mail address (for any length-email address).

Done
As you can see, the entire process of that parsing routine was:
          Find the start (InStr)
          Find the end (InStr)
          Extract the data between (Mid)

And you know what? That is the exact same process you will use 90% of the time when you want to extract data from between two other strings.

Try this: Change up the HTML. Change the e-mail address. Change the values of strStart and strEnd in the code to match those of the HTML. Run the code. It will work regardless (as long as you got the Start and End strings right).

Since most of your parsing routines will use this method, it might be a good idea to wrap up all the code in a re-usable function.

Wrapping it up
Wrapping up the above code into a reusable function:

Private Function GetBetween(ByVal Start As Long, Data As String, _
    StartString As String, EndString As String, _
    Optional ByVal CompareMethod As VbCompareMethod = vbBinaryCompare) As String
   
    Dim lonStart As Long, lonEnd As Long
   
    '1. Find start string.
    lonStart = InStr(Start, Data, StartString, CompareMethod)
   
    If lonStart > 0 Then
        '2. Move to end of start string.
        lonStart = lonStart + Len(StartString)

        '3. Find end string.
        lonEnd = InStr(lonStart, Data, EndString, CompareMethod)
       
        If lonEnd > 0 Then
            '4. Extract data between start and end strings.
            GetBetween = Mid$(Data, lonStart, lonEnd - lonStart)
        End If
    End If
   
End Function

And if we were to use this function for this scenario, it woud be:

strEmail = GetBetween(1, HTML, "<a href=""#""><strong>", "</strong>", vbTextCompare)
MsgBox strEmail

1 - Where we start searching in the string (beginning).
HTML - The HTML we are working with.
<a href="#... - The start string.
</strong> - The end string.
vbTextCompare - Case-insensitive search (slower, but more reliable for HTML).
vbBinaryCompare - Case-sensitive search (faster, but more strict).

I hope this short little tutorial helps someone. Feel free to contact me if you have any questions/comments or suggestions on something to add to this tutorial, or suggestions for a new tutorial.

Remember the steps: Find the start, find the end, extract between.

Post a Comment

Harap gunakan bahasa yang baik dan sopan, terima kasih