EDGAR is the SEC documents registration system. This post is part of a series on the EDGAR data, how to acquire it, analyze it, and use it for greater powers. See the series index
In this post
- Explore EDGAR data structure and associated documents/attributes
- Develop strategy to sample data to home computer managable
- Acquire the data
What We’ve Got
- Starting point – a list of companies and their EDGAR submissions.
- Parsed accounting data for publicly traded companies
What’s Next (Issues)
- Current processing limited (django-sec)
- Not all form types are being processed
- Not all fields from 10-K/Q processed
- No company demographic information
- Data not structured for analysis
- Entire dataset is still pretty big
Getting More Data
We’ve got a 5GB database of company names and an index of associated documents from 2007-2017. I want demographic data from the company profile page, acquire and structure the 13F-HR filings (institutional investor holdings).
Scrape EDGAR
We’ve got a nice website to scrape, and can dump a list of cik values from SQLite to parameterize the URLs to extract the data we want. I use scrapy
because it comes with a caching engine, making it super fast to rerun in the future.
Next step: use scrapy to scrape company demographics into local csv files.