-
Notifications
You must be signed in to change notification settings - Fork 326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cookbook Extract Links from HTML #2552
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ggsmith842. Rendering does seem to work. You probably need to use an OCaml quoted string to store the HTML source.
Also, the comments should not translate the code in English, it should tend to be additional information helping to understand the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left few questions and a couple of suggestions for your review @ggsmith842. Together, we can clarify those two sections! 🙂
Co-authored-by: Christine Rose <[email protected]>
Co-authored-by: Christine Rose <[email protected]>
Simplified comments and removed `read_file` section. It is now replaced with an HTML string. I also removed the sample HTML that would render for simplicity.
@cuihtlauac and @christinerose thank you for the feedback! I simplified the comments section so hopefully it makes it more concise and appropriate for the recipe. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
I'd advise against trying to teach people to parse HTML with regular expressions :-) https://stackoverflow.com/a/1732454/20371 Maybe we can recommend Anton Bachin's excellent lambdasoup package? https://ocaml.org/p/lambdasoup/ |
@yawaramin let me look into this and make some changes. Thank you for the suggestion. I wasn't aware of how regex isn't suited for working with html. |
No worries. Btw it's the second example shown on the docs page: https://aantron.github.io/lambdasoup/ |
Changed from using regular expression to lambdasoup due to concerns raised about issues working with HTML using regular expressions.
|
||
`Re.all` searches the entire `html_content` string for the `pattern`. Passing `1` to `Re.Group.get` returns the | ||
substring versus the entire matching group. | ||
`$$` selects the links in the document. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say 'selects nodes in the document using a selector query'.
open Soup | ||
|
||
let find_links html_content = | ||
let document_node = Soup.parse html_content in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we already have open Soup
above this, Soup.
prefix is redundant here. Actually we can just do:
parse html_content $$ "a[href]" |> iter ...
change `Soup.parse` to `parse` update documentation on selector query `$$`
Looks like there's only a small change requested here, anything blocking us from doing this? |
used_libraries: | ||
- lambdasoup | ||
discussion: | | ||
- **Refernce:** The lambdasoup package provides a robust toolset for working with HTML. [github.com/lambdasoup](https://github.com/aantron/lambdasoup?tab=readme-ov-file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reference
I've fixed up a couple of things in my branch: https://github.com/yawaramin/ocaml.org/compare/extract-links-html?expand=1 Unfortunately, GitHub is not recognizing my repo as a fork, so it won't let me create a PR. If anyone is interested, please feel free to pull in my branch which should take care of the remaining issues and clean up the recipe for merge. |
Adds an example for the extract-links-from-html task using the
Re
library. A sample html string is provided and the usage example shows how to read an HTML and then print the links found.