A dynamic system to email K-1 PDFs to investors.
Every tax year we need to send hundreds of K-1 tax forms to investors. We receive these forms from our external accountants. This project automates the process of matching K-1 PDFs to an internal investor contact table and then emailing the attachments with an interpolated email body. The emails are sent via the Outlook API.
k1_processor.py
: Contains theK1BatchProcessor
class that handles all the workmain.py
: Entry pointconfig.py
: Set running parameters here for each code run (not tracked due to constant changing, but can be recreated fromconfig.pytemplate
)auth.py
: Microsoft API authentication (more on this below)logger.py
: Logging configuration
Inside the entry point, you can choose which of the external methods get run. The methods are called from outside the class to allow for step-by-step processing, instead of being forced to run everything in one shot. This is specifically built in as a safeguard because emailing investors and handling tax data is extremely sensitive.
- Manually copy K-1 PDFs into the
files
directory into their respective investment folders. - Ensure
investors.xlsx
contains correct investor information. - Set running parameters in
config.py
, which get imported into the entry point (createconfig.py
fromconfig.pytemplate
if it does not exist). See the__init__()
method ofK1BatchProcessor
docstring for explanations of how to set the config parameters. - Instantiating the
K1BatchProcessor
class in the entry point (ensures the correct folder structure as explained below and) gathers the K-1s from the folders to prepare for processing. "Managers" K-1s are excluded as they are not emailed to investors. - The
extract_entities()
method reads the PDFs and attempts to extract the issuing entity and receiving entity from each. These are stored in apickle
cache to speed up future runs on the same files (in the case of staggered emailing or testing or any other required re-run). The cache will be loaded if it exists, otherwise extraction will be run on all gathered files. - The
match_files_and_keys()
method attempts to match the extracted entity information from each file to an investor contact ininvestors.xlsx
to prepare for emailing. - The
send_emails()
method sends emails with K-1 attachments to the matched investors. You will be prompted to(y/n)
confirm that you want to send emails (another safeguard).
These directories and their contents are not tracked, however logs
, snapshots
, and investors.xlsx
are synced to S3.
cache
: Contains a single filepickle
cache of extracted entities from each K-1 filedumps
: Stores text files of the extracted text from each K-1 pagefiles
: Contains folders for each investment, holding the K-1 PDFslogs
: Stores text logs of standard output (print statements, etc.) from code runs, and csv logs of unmatched filessnapshots
: Stores snapshots ofinvestors.xlsx
as backups
- Every time the class is instantiated, a timestamped snapshot of
investors.xlsx
is stored inside thesnapshots
directory for safety - A timestamped text file is stored inside the
logs
directory, containing the standard output from every code run. Theprint_k1_array()
method can be called to include appending of thek1_array
(i.e., result of theextract_entities()
method) to this log file. This will not print thek1_array
to the terminal to avoid crowding - A timestamped csv file is stored inside the
logs
directory wheneverextract_entities()
is called, containing a table of the K-1 files that did not match to any investor contacts withininvestors.xlsx
- A timestamped csv file is stored inside the
logs
directory wheneversent_emails()
is run, containing all attempted investor rows along with the sent status and timestamp
The logs
directory, snapshots
directory, and investors.xlsx
file are synced to S3 whenever changes are made to them. These changes are kept track of during code runs using instance variables as flags (e.g., self.logs_changed
).
As mentioned, emails are sent via the Outlook API, which uses the msal
package for authentication inside auth.py
. The credentials that are fed to msal
are stored in AWS Parameter Store (we are only using Azure at all because it is required to use the Outlook API, but AWS is our cloud platform). Thus, your environment needs to be configured with AWS credentials in order for the Outlook API to be authenticated.