Why DataKit?
It's easy to lose track of something when working on multiple stories and projects in a busy newsroom. DataKit aims for two goals:
- Automate important-yet-repetitive tasks in data projects, letting reporters cover more ground while maintaining standards of reproducibility.
- Be customizable enough to accommodate practices in different newsrooms.
How does the AP use DataKit?
At the AP, the typical data story is powered by DataKit:
datakit project create
datakit data init
datakit gitlab integrate
datakit dworld create
,datakit dworld push
All projects have the same, standard overall folder structure. Any new project is created already with places to put interview materials, data files from sources and code files. There is no confusion over which data files are straight from sources and which ones are intermediate work products, or where a particular analysis script is.
Data files are synced to Amazon S3 cloud storage. An unwieldy command-line ritual has been boiled down to a single interface for pushing and pulling: datakit data push
. The specifics of the S3 sync are stored in the project, allowing collaborators to pull copies of the project with the single command datakit data pull
.
Code is all tracked in version control and hosted on Gitlab. Important analysis code doesn't live on only one computer, and a detailed revision history is now an assumed feature of all projects. The integration automatically creates a project in gitlab, ready for the reporter to push commits. Issues can be quickly filed using gitlab issues add
without having to use the web interface.
Data files for distributions are managed through DataKit. This streamlines the process of releasing data files for members while also enforcing specific rules on which files are made public.