Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issue regarding .to_markdown() method #380

Closed
red-sky17 opened this issue Jun 24, 2024 · 4 comments
Closed

issue regarding .to_markdown() method #380

red-sky17 opened this issue Jun 24, 2024 · 4 comments

Comments

@red-sky17
Copy link

since the new version release 1.8.0 we are not able to use the method .to_markdown() method.

The workflow we use is as follows (mainly used for pdfs):

  • create json by calling call_textract function with features included layout and table
  • use the json to create "textractor.entities.document.Document"
  • later we are experimenting with these three mehtods: get_text_from_layout_json() or "textractor.entities.document.Document" .get_text() with configs or document.to_markdown() method

using .to_markdown() method we are getting:

---------------------------------------------------------------------------
UnboundLocalError                         Traceback (most recent call last)
Cell In[36], line 1
----> 1 print(document.to_markdown())
      2 with open('using_markdown_testimage.txt', 'w', encoding = "utf-8") as outfile:
      3     outfile.write(document.to_markdown())

File /opt/conda/lib/python3.10/site-packages/textractor/entities/linearizable.py:59, in Linearizable.to_markdown(self, config)
     49 def to_markdown(
     50     self,
     51     config: MarkdownLinearizationConfig = MarkdownLinearizationConfig()
     52 ) -> str:
     53     """
     54     Returns the markdown representation of the entity
     55 
     56     :return: Markdown text of the entity
     57     :rtype: str
     58     """
---> 59     return self.get_text(config)

File /opt/conda/lib/python3.10/site-packages/textractor/entities/linearizable.py:24, in Linearizable.get_text(self, config)
     13 def get_text(
     14     self, config: TextLinearizationConfig = TextLinearizationConfig()
     15 ) -> str:
     16     """
     17     Returns the linearized text of the entity
     18 
   (...)
     22     :rtype: str
     23     """
---> 24     text, _ = self.get_text_and_words(config=config)
     25     return text

File /opt/conda/lib/python3.10/site-packages/textractor/entities/document.py:261, in Document.get_text_and_words(self, config)
    258 def get_text_and_words(
    259     self, config: TextLinearizationConfig = TextLinearizationConfig()
    260 ) -> Tuple[str, List]:
--> 261     text, words_lists = zip(*[p.get_text_and_words(config) for p in self.pages])
    262     flattened_words = []
    263     for words in words_lists:

File /opt/conda/lib/python3.10/site-packages/textractor/entities/document.py:261, in <listcomp>(.0)
    258 def get_text_and_words(
    259     self, config: TextLinearizationConfig = TextLinearizationConfig()
    260 ) -> Tuple[str, List]:
--> 261     text, words_lists = zip(*[p.get_text_and_words(config) for p in self.pages])
    262     flattened_words = []
    263     for words in words_lists:

File /opt/conda/lib/python3.10/site-packages/textractor/entities/page.py:169, in Page.get_text_and_words(self, config)
    166         else:
    167             sorted_layouts.append(unsorted_layout)
--> 169 page_texts_and_words = [l.get_text_and_words(config) for l in sorted_layouts]
    171 if not page_texts_and_words:
    172     return "", []

File /opt/conda/lib/python3.10/site-packages/textractor/entities/page.py:169, in <listcomp>(.0)
    166         else:
    167             sorted_layouts.append(unsorted_layout)
--> 169 page_texts_and_words = [l.get_text_and_words(config) for l in sorted_layouts]
    171 if not page_texts_and_words:
    172     return "", []

File /opt/conda/lib/python3.10/site-packages/textractor/entities/layout.py:222, in Layout.get_text_and_words(self, config)
    218     final_text = (
    219         config.text_prefix + final_text + config.text_suffix
    220     )
    221 else:
--> 222     final_text, final_words = linearize_children(
    223         self.children,
    224         config,
    225         no_new_lines=False,
    226         is_layout_table=self.layout_type == LAYOUT_TABLE,
    227     )
    229     if config.add_prefixes_and_suffixes_in_text:
    230         if self.layout_type == LAYOUT_TABLE:

File /opt/conda/lib/python3.10/site-packages/textractor/utils/text_utils.py:150, in linearize_children(elements, config, no_new_lines, is_layout_table)
    148 added_words = set()
    149 for idx, element in enumerate(sorted_group):
--> 150     text_element, words_element = element.get_text_and_words(config)
    151     if "Table" in element.__class__.__name__ and len(words_element):
    152         result += text_element

File /opt/conda/lib/python3.10/site-packages/textractor/entities/table.py:1033, in Table.get_text_and_words(self, config)
   1030         text += (local_config.table_row_suffix if local_config.add_prefixes_and_suffixes_in_text else "")
   1031         text += local_config.table_row_separator
-> 1033 if local_config.table_add_title_as_caption and self.title:
   1034     text += "<caption>" + self.title.get_text() + "</caption>"
   1036 text += (local_config.table_suffix if local_config.add_prefixes_and_suffixes_in_text else "")

UnboundLocalError: local variable 'local_config' referenced before assignment

let me know if this is an issue from our side.

@red-sky17
Copy link
Author

#356

@Belval
Copy link
Contributor

Belval commented Jun 24, 2024

This is an issue with 1.8.0, I'll release version 1.8.1 that addresses this issue as soon as the tests are passing.

@Belval
Copy link
Contributor

Belval commented Jul 3, 2024

Is this still an issue?

@red-sky17
Copy link
Author

no, its resolved we can close this.

@Belval Belval closed this as completed Aug 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants