DataLoader Class
- class edgar.DataLoader(tikrs: str, document_type: str, data_dir: str = 'data', parser: ~typing.Callable[[str], str] = <function clean_text>, config: ~typing.Optional[~edgar.DataLoaderConfig] = None, loading_bar: bool = True)
Master class for iterating through parsed text of filing submissions.
- __init__(tikrs: str, document_type: str, data_dir: str = 'data', parser: ~typing.Callable[[str], str] = <function clean_text>, config: ~typing.Optional[~edgar.DataLoaderConfig] = None, loading_bar: bool = True)
Construct desired dataloading pipeline.
- Parameters
tikrs (List[str]) – A set of companies to load documents for.
document_type (DocumentType or str) – The submission type to load documents for.
parser (Callable[[str], str]) – A function that turns a raw document into cleaned output.
loading_bar (bool = True) – if True, will display a tqdm loading bar while downloading files.
Notes
If files for associated company tickers are not available locally, will download and cache them during initialization for future use.
Usage
The DataLoader will find, load, and return parsed text from a set of desired documents. The constructor downloads the necessary files and the text is extracted just-in-time when the user calls for it.
The interface supports slice-indexing and iterating.
1from edgar import DataLoader
2
3data = DataLoader(tikrs=['nflx'], '8-K')
4print(data[14])
5for text in data:
6 print(text)
7print(data[:3])
Custom Parse Functions
In order to access raw data, we can pass in ‘no operation’ cleaning function.
def noop(x):
return x
data = DataLoader(['nflx'], '8-K', parser=noop)
The default cleaning function is as follows:
from edgar import html
def clean(text):
# Remove html <> tags
text = html.remove_tags(text)
# Remove most malformed characters
text = html.remove_htmlbytes(text)
# Remove newlines / tabs
text = re.sub('[\n\t]', ' ', text)
# Replace multiple spaces in a row with one
text = html.compress_spaces(text)
return text
For custom purposes, a modified cleaning function can be passed to the dataloader.
For example, we may want to remove numerical tables from 10-Q forms.
from edgar import html
def clean_tables(text):
# Remove numerical tables
text = html.remove_tables(text)
# Remove html <> tags
text = html.remove_tags(text)
# Remove most malformed characters
text = html.remove_htmlbytes(text)
# Remove newlines / tabs
text = re.sub('[\n\t]', ' ', text)
# Replace multiple spaces in a row with one
text = html.compress_spaces(text)
return text