Description
What is parsing?
Parsing is another term used for "interpreting". In most
cases, it means extracting information from a string. For example, you may want
to extract certain parts from the HTML source of a webpage. When parsing, you
will be using the basic string manipulation functions from VB. These being: Left, Mid, Right, Trim, InStr, InStrRev, Split, etc. In most cases, it's usually just a combination of InStr and Mid (avoid using the Split function if possible).
It's really hard to teach "parsing" to someone because it's
entirely dependent on what data you're working with. However, after some
practice, you will start to notice patterns and realize that most parsing
situations call for almost the same thing.
AUTHOR
Copyright
(c) 2007 - Danny Elkins (DigiRev)
http://www.DannyDotGuitar.com/digirev/
DigiRev@Hotmail.com
How To
Scenario 1: HTML
Parsing
HTML
In this scenario, we will be parsing some HTML returned from a
webpage. Let's say we wanted to parse, or extract, the e-mail address from a
webpage. The HTML looks like this:
<html>
<head>
<title>Parse this</title>
</head>
<body>
<strong>User@Hotmail.com</strong>
</body>
</html>
That is probably the most basic HTML you will ever come across. It is
just to get a basic idea of how a common parsing routine works. It doesn't
matter how "complex" or confusing the HTML looks. The parsing process
will be the same.
Step
1: Find the start!
The first step in any extraction routine, is to find the start. What
is the start? It is a constant value that comes immediately before what we're
looking for. Most HTML isn't constant, so it's best to keep your routine
as loose, but reliable as possible. It's also common to have to modify, or
re-write some parsing routines in the future as the data changes.
Anyway, what is the start that we're looking for here? The first thing
that comes before the e-mail address. In this case, it would be <strong>.
WAIT!
An important thing to consider when looking for the start, is: Does
that data appear anywhere else in the HTML? Does it come before or after? For
example, what if the HTML looked like this?
<html>
<head>
<title>Parse this</title>
</head>
<body>
<strong>Welcome!</strong>
<strong>User@Hotmail.com</strong>
</body>
</html>
What would be the start there? You would have to find the 2nd
instance of
<strong>. What if the HTML looked like this?
<html>
<head>
<title>Parse this</title>
</head>
<body>
<strong>Welcome!</strong>
<a
href="#"><strong>User@Hotmail.com</strong></a>
</body>
</html>
What would we use as the start there? We wouldn't use just <strong>, obviously, because that would bring us to "Welcome!".
We would use <a href="#"><strong>. That string only appears once. So, always find the most constant,
unique, string to use as the start.
Finding
the start
So, how do we find the start in the HTML? VB has a nice built-in
function called InStr. I did assume you have some experience with basic
string manipulation functions, but if not then read the comments...(and then go
learn how to use it. J).
Dim lonPos As Long
Dim strStart As String
'The start
string.
strStart = "<a href=""#""><strong>"
'Find the start
string.
lonPos = InStr(1,
HTML, strStart, vbTextCompare)
'1 - Where we start
searching in the string (from the beginning).
'HTML - The string
holding the HTML.
'strStart - What we're
searching for.
'vbTextCompare - Case
in-sensitive search (more reliable for HTML).
'vbBinaryCompare - Faster, but
it's case-sensitive.
If the search string was found, lonPos will contain the
starting position. The starting position would be the < in <a href="#".
Step
2: Find the end!
Yup, now that we found the start, we find the end. It's pretty simple.
The end would be what comes immediately after what we're looking for. In this
example, it would be </strong>. So, all we do is use the
InStr function again. Except, this time, we will supply the function
with lonPos and have it start searching from there. If we searched from
the beginning, it would take us to the end of "Welcome!".
Dim lonPos As Long, lonEnd As Long
Dim strStart As String, strEnd As String
Dim strEmail As String
'The start
string.
strStart = "<a href=""#""><strong>"
strEnd = "</strong>"
'Find the start
string.
lonPos = InStr(1,
HTML, strStart, vbTextCompare)
If lonPos > 0 Then
'Move to the
end of the start string
'which happens to be the beginning of what
we're looking for. :)
lonPos = lonPos + Len(strStart)
'Find the end
string starting from where we found the start.
lonEnd = InStr(lonPos,
HTML, strEnd, vbTextCompare)
If lonEnd > 0 Then
'Now, we
have the starting and ending position.
'What we do is extract the information
between them.
'The length of data (e-mail address)
will be:
'lonEnd - lonPos
strEmail = Mid$(HTML, lonPos, lonEnd - lonPos)
'Done!
MsgBox strEmail
End If
End If
A little explanation:
If lonPos > 0 Then
Checks if we found the start. If InStr didn't find
it, it will return 0.
lonPos = lonPos + Len(strStart)
This
will take us from the beginning of the start string (X<a
href="#"></strong>)
to the end of the start string (<a
href="#"></strong>X)
At the end
of the start string is what we're looking for (the e-mail address).
lonEnd = InStr(lonPos, HTML, strEnd, vbTextCompare)
The
search will start from lonPos and will find strEnd (</strong>).
If lonEnd > 0 Then
If InStr
found the ending string (</strong>) then...
strEmail = Mid$(HTML, lonPos, lonEnd - lonPos)
We are using Mid to extract something from the
middle of the string.
We
start at lonPos. This starts with the first character of the e-mail
address.
We
end at lonEnd - lonPos. That will equal the length of the e-mail address
(for any length-email address).
Done
As you can see, the entire process of that parsing routine was:
Find the start (InStr)
Find the end (InStr)
Extract the data between (Mid)
And you know what? That is the exact same process you will use 90% of
the time when you want to extract data from between two other strings.
Try this: Change up the HTML.
Change the e-mail address. Change the values of strStart and strEnd
in the code to match those of the HTML. Run the code. It will work regardless
(as long as you got the Start and End strings right).
Since most of your parsing routines will use this method, it might be
a good idea to wrap up all the code in a re-usable function.
Wrapping
it up
Wrapping up the above code into a reusable function:
Private Function GetBetween(ByVal Start As Long,
Data As String, _
StartString As
String, EndString As String, _
Optional ByVal CompareMethod
As VbCompareMethod = vbBinaryCompare) As String
Dim lonStart
As Long, lonEnd As
Long
'1. Find start
string.
lonStart = InStr(Start, Data, StartString, CompareMethod)
If lonStart
> 0 Then
'2. Move to
end of start string.
lonStart = lonStart + Len(StartString)
'3. Find
end string.
lonEnd = InStr(lonStart, Data, EndString, CompareMethod)
If lonEnd
> 0 Then
'4.
Extract data between start and end strings.
GetBetween
= Mid$(Data, lonStart, lonEnd
- lonStart)
End If
End If
End Function
And if we were to use this function for this scenario, it woud be:
strEmail = GetBetween(1, HTML, "<a href=""#""><strong>",
"</strong>", vbTextCompare)
MsgBox strEmail
1 - Where we start searching in the string
(beginning).
HTML - The HTML we are working with.
<a href="#... - The start string.
</strong> - The end
string.
vbTextCompare - Case-insensitive search (slower, but more
reliable for HTML).
vbBinaryCompare - Case-sensitive search
(faster, but more strict).
I hope this short little tutorial helps someone. Feel free to contact
me if you have any questions/comments or suggestions on something to add to
this tutorial, or suggestions for a new tutorial.
Remember the steps: Find the start,
find the end, extract
between.
Post a Comment
Harap gunakan bahasa yang baik dan sopan, terima kasih