Cookbook Extract Links from HTML #2552

ggsmith842 · 2024-06-26T21:16:45Z

Adds an example for the extract-links-from-html task using the Re library. A sample html string is provided and the usage example shows how to read an HTML and then print the links found.

cuihtlauac

Thanks @ggsmith842. Rendering does seem to work. You probably need to use an OCaml quoted string to store the HTML source.

Also, the comments should not translate the code in English, it should tend to be additional information helping to understand the code.

data/cookbook/extract-links-from-html/00-Re.ml

christinerose

I've left few questions and a couple of suggestions for your review @ggsmith842. Together, we can clarify those two sections! 🙂

Co-authored-by: Christine Rose <[email protected]>

Simplified comments and removed `read_file` section. It is now replaced with an HTML string. I also removed the sample HTML that would render for simplicity.

ggsmith842 · 2024-06-27T16:43:44Z

@cuihtlauac and @christinerose thank you for the feedback! I simplified the comments section so hopefully it makes it more concise and appropriate for the recipe.

christinerose

LGTM!

yawaramin · 2024-07-04T16:20:21Z

I'd advise against trying to teach people to parse HTML with regular expressions :-) https://stackoverflow.com/a/1732454/20371

Maybe we can recommend Anton Bachin's excellent lambdasoup package? https://ocaml.org/p/lambdasoup/

ggsmith842 · 2024-07-04T16:38:04Z

@yawaramin let me look into this and make some changes. Thank you for the suggestion. I wasn't aware of how regex isn't suited for working with html.

yawaramin · 2024-07-04T18:16:06Z

No worries. Btw it's the second example shown on the docs page: https://aantron.github.io/lambdasoup/

Changed from using regular expression to lambdasoup due to concerns raised about issues working with HTML using regular expressions.

yawaramin · 2024-07-04T20:41:17Z

data/cookbook/extract-links-from-html/00-lambdasoup.ml


-`Re.all` searches the entire `html_content` string for the `pattern`. Passing `1` to `Re.Group.get` returns the 
-substring versus the entire matching group.
+`$$` selects the links in the document.


I'd say 'selects nodes in the document using a selector query'.

yawaramin · 2024-07-04T20:45:51Z

data/cookbook/extract-links-from-html/00-lambdasoup.ml

+open Soup
+
+let find_links html_content =
+  let document_node = Soup.parse html_content in 


Since we already have open Soup above this, Soup. prefix is redundant here. Actually we can just do:

parse html_content $$ "a[href]" |> iter ...

change `Soup.parse` to `parse` update documentation on selector query `$$`

yawaramin · 2024-07-09T17:01:16Z

Looks like there's only a small change requested here, anything blocking us from doing this?

yawaramin · 2024-07-10T05:46:34Z

data/cookbook/extract-links-from-html/00-lambdasoup.ml

+    used_libraries:
+      - lambdasoup
+discussion: |
+  - **Refernce:** The lambdasoup package provides a robust toolset for working with HTML. [github.com/lambdasoup](https://github.com/aantron/lambdasoup?tab=readme-ov-file)


yawaramin · 2024-07-10T22:59:40Z

I've fixed up a couple of things in my branch: https://github.com/yawaramin/ocaml.org/compare/extract-links-html?expand=1

Unfortunately, GitHub is not recognizing my repo as a fork, so it won't let me create a PR. If anyone is interested, please feel free to pull in my branch which should take care of the remaining issues and clean up the recipe for merge.

Cookbook Extract Links from HTML

8ea49e9

ggsmith842 marked this pull request as ready for review June 26, 2024 21:18

ggsmith842 requested a review from christinerose as a code owner June 26, 2024 21:18

cuihtlauac requested changes Jun 27, 2024

View reviewed changes

cuihtlauac added the Cookbook label Jun 27, 2024

formatting, grammar, verb agreement, etc.

3877f3f

christinerose reviewed Jun 27, 2024

View reviewed changes

data/cookbook/extract-links-from-html/00-Re.ml Outdated Show resolved Hide resolved

christinerose reviewed Jun 27, 2024

View reviewed changes

data/cookbook/extract-links-from-html/00-Re.ml Outdated Show resolved Hide resolved

christinerose reviewed Jun 27, 2024

View reviewed changes

ggsmith842 and others added 3 commits June 27, 2024 10:16

Update data/cookbook/extract-links-from-html/00-Re.ml

e56dee5

Co-authored-by: Christine Rose <[email protected]>

Update data/cookbook/extract-links-from-html/00-Re.ml

4b29462

Co-authored-by: Christine Rose <[email protected]>

Update 00-Re.ml

da5c2eb

Simplified comments and removed `read_file` section. It is now replaced with an HTML string. I also removed the sample HTML that would render for simplicity.

ggsmith842 requested review from christinerose and cuihtlauac June 27, 2024 16:44

christinerose approved these changes Jul 1, 2024

View reviewed changes

Update and rename 00-Re.ml to 00-lambdasoup.ml

e2db468

Changed from using regular expression to lambdasoup due to concerns raised about issues working with HTML using regular expressions.

yawaramin reviewed Jul 4, 2024

View reviewed changes

Update 00-lambdasoup.ml

d7ad4d8

change `Soup.parse` to `parse` update documentation on selector query `$$`

yawaramin approved these changes Jul 4, 2024

View reviewed changes

yawaramin reviewed Jul 10, 2024

View reviewed changes

yawaramin mentioned this pull request Aug 5, 2024

Cookbook Check a Webpage for Broken Links #2581

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cookbook Extract Links from HTML #2552

Cookbook Extract Links from HTML #2552

ggsmith842 commented Jun 26, 2024

cuihtlauac left a comment •

edited

Loading

christinerose left a comment

ggsmith842 commented Jun 27, 2024

christinerose left a comment

yawaramin commented Jul 4, 2024

ggsmith842 commented Jul 4, 2024

yawaramin commented Jul 4, 2024

yawaramin Jul 4, 2024

yawaramin Jul 4, 2024

yawaramin commented Jul 9, 2024

yawaramin Jul 10, 2024

yawaramin commented Jul 10, 2024

Cookbook Extract Links from HTML #2552

Are you sure you want to change the base?

Cookbook Extract Links from HTML #2552

Conversation

ggsmith842 commented Jun 26, 2024

cuihtlauac left a comment • edited Loading

Choose a reason for hiding this comment

christinerose left a comment

Choose a reason for hiding this comment

ggsmith842 commented Jun 27, 2024

christinerose left a comment

Choose a reason for hiding this comment

yawaramin commented Jul 4, 2024

ggsmith842 commented Jul 4, 2024

yawaramin commented Jul 4, 2024

yawaramin Jul 4, 2024

Choose a reason for hiding this comment

yawaramin Jul 4, 2024

Choose a reason for hiding this comment

yawaramin commented Jul 9, 2024

yawaramin Jul 10, 2024

Choose a reason for hiding this comment

yawaramin commented Jul 10, 2024

cuihtlauac left a comment •

edited

Loading