Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cookbook Extract Links from HTML #2552

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Conversation

ggsmith842
Copy link
Contributor

Adds an example for the extract-links-from-html task using the Re library. A sample html string is provided and the usage example shows how to read an HTML and then print the links found.

@ggsmith842 ggsmith842 marked this pull request as ready for review June 26, 2024 21:18
Copy link
Collaborator

@cuihtlauac cuihtlauac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ggsmith842. Rendering does seem to work. You probably need to use an OCaml quoted string to store the HTML source.

image

Also, the comments should not translate the code in English, it should tend to be additional information helping to understand the code.

Copy link
Collaborator

@christinerose christinerose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left few questions and a couple of suggestions for your review @ggsmith842. Together, we can clarify those two sections! 🙂

ggsmith842 and others added 3 commits June 27, 2024 10:16
Simplified comments and removed `read_file` section. It is now replaced with an HTML string. I also removed the sample HTML that would render for simplicity.
@ggsmith842
Copy link
Contributor Author

@cuihtlauac and @christinerose thank you for the feedback! I simplified the comments section so hopefully it makes it more concise and appropriate for the recipe.

Copy link
Collaborator

@christinerose christinerose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@yawaramin
Copy link
Contributor

I'd advise against trying to teach people to parse HTML with regular expressions :-) https://stackoverflow.com/a/1732454/20371

Maybe we can recommend Anton Bachin's excellent lambdasoup package? https://ocaml.org/p/lambdasoup/

@ggsmith842
Copy link
Contributor Author

@yawaramin let me look into this and make some changes. Thank you for the suggestion. I wasn't aware of how regex isn't suited for working with html.

@yawaramin
Copy link
Contributor

No worries. Btw it's the second example shown on the docs page: https://aantron.github.io/lambdasoup/

Changed from using regular expression to lambdasoup due to concerns raised about issues working with HTML using regular expressions.

`Re.all` searches the entire `html_content` string for the `pattern`. Passing `1` to `Re.Group.get` returns the
substring versus the entire matching group.
`$$` selects the links in the document.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say 'selects nodes in the document using a selector query'.

open Soup

let find_links html_content =
let document_node = Soup.parse html_content in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we already have open Soup above this, Soup. prefix is redundant here. Actually we can just do:

parse html_content $$ "a[href]" |> iter ...

change `Soup.parse` to `parse` update documentation on selector query `$$`
@yawaramin
Copy link
Contributor

Looks like there's only a small change requested here, anything blocking us from doing this?

used_libraries:
- lambdasoup
discussion: |
- **Refernce:** The lambdasoup package provides a robust toolset for working with HTML. [github.com/lambdasoup](https://github.com/aantron/lambdasoup?tab=readme-ov-file)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reference

@yawaramin
Copy link
Contributor

I've fixed up a couple of things in my branch: https://github.com/yawaramin/ocaml.org/compare/extract-links-html?expand=1

Unfortunately, GitHub is not recognizing my repo as a fork, so it won't let me create a PR. If anyone is interested, please feel free to pull in my branch which should take care of the remaining issues and clean up the recipe for merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants