.. _dataloader: DataLoader Class ---------------------- .. autoclass:: edgar.DataLoader :members: .. automethod:: __init__ Usage ===== The DataLoader will find, load, and return parsed text from a set of desired documents. The constructor downloads the necessary files and the text is extracted just-in-time when the user calls for it. The interface supports slice-indexing and iterating. .. code-block:: python :linenos: from edgar import DataLoader data = DataLoader(tikrs=['nflx'], '8-K') .. code-block:: python :linenos: :lineno-start: 4 print(data[14]) .. code-block:: python :linenos: :lineno-start: 5 for text in data: print(text) .. code-block:: python :linenos: :lineno-start: 7 print(data[:3]) Custom Parse Functions ======================== In order to access raw data, we can pass in 'no operation' cleaning function. .. code-block:: python def noop(x): return x data = DataLoader(['nflx'], '8-K', parser=noop) The default cleaning function is as follows: .. code-block:: python from edgar import html def clean(text): # Remove html <> tags text = html.remove_tags(text) # Remove most malformed characters text = html.remove_htmlbytes(text) # Remove newlines / tabs text = re.sub('[\n\t]', ' ', text) # Replace multiple spaces in a row with one text = html.compress_spaces(text) return text For custom purposes, a modified cleaning function can be passed to the dataloader. For example, we may want to remove numerical tables from 10-Q forms. .. code-block:: python from edgar import html def clean_tables(text): # Remove numerical tables text = html.remove_tables(text) # Remove html <> tags text = html.remove_tags(text) # Remove most malformed characters text = html.remove_htmlbytes(text) # Remove newlines / tabs text = re.sub('[\n\t]', ' ', text) # Replace multiple spaces in a row with one text = html.compress_spaces(text) return text