sproc
This is Python code meant to download and parse Spanish government’s Plataforma de contratación del sector público metadata. It produces parquet files that can be easily read in many programming languages.
This project was developed with nbdev, and hence each module stems from a Jupyter notebook that contains the code, along with tests and documentation. If you are interested in the inner workings of any module you can check its corresponding notebook in the appropriate section of the github pages of the project.
Install
pip install git+https://github.com/nextprocurement/sproc@main
should do.
How to use
The software can be exploited as a library or as standalone scripts.
Scripts
Downloading data
sproc_dl
command is the work-horse of the library. It allows downloading all the data of a given kind
into a parquet file, that later can be updated invoking the same command. Running, e.g.,
sproc_dl outsiders
will download all the aggregated procurement data (excluding minor contracts), and write an outsiders.parquet
file. Argument -o
can be used to specify a directory other than the current one. Instead of outsiders
, one can pass insiders
or minors
.
This is the highest-level command, and most likely the only one you need. The remaining ones (briefly explained below) provide access to finer granularity functionality.
Processing a single zip file
For testing purposes one can download Outsiders contracts for 2018, either directly by clicking this link or, if wget is available, running
wget https://contrataciondelsectorpublico.gob.es/sindicacion/sindicacion_1044/PlataformasAgregadasSinMenores_2018.zip
Running
sproc_read_single_zip.py PlataformasAgregadasSinMenores_2018.zip 2018.parquet
outputs the file 2018.parquet
(the name being given by the 2nd argument), which contains a pd.DataFrame
with all the 2018 metadata. It can be readily loaded (in Python, through Pandas’ pd.read_parquet
). The columns of the pd.DataFrame
stored inside are multiindexed (meaning one could get columns such as (ContractFolderStatus','ContractFolderID)
and (ContractFolderStatus','ContractFolderStatusCode)
. This is very convenient when visualizing the data (see the the documentation for the hier
module).
From hierarchical (multiindexed) columns to plain ones
The columns of the above pd.DataFrame
can be flattened to get, in the example above, ContractFolderStatus - ContractFolderID
and ContractFolderStatus - ContractFolderStatusCode
, respectively. Additionally, some renaming might be applied following the mapping in some YAML file
sproc_rename_cols.py 2018.parquet -l samples/PLACE.yaml
This would yield a pd.DataFrame
with plain columns in file 2018_renamed.parquet
. Renaming is carried out using the mapping in PLACE.yaml, which can be found in the samples
directory of this repository. If you don’t provide a local file (-l
) or a remote file (-r
), a default naming scheme will be used if the name of the input file is outsiders.parquet
, insiders.parquet
, or minors.parquet
.
Processing a list of zip files
Command sproc_read_zips.py
can be used to batch-process a sequence of files, e.g.,
sproc_read_zips.py contratosMenoresPerfilesContratantes_2018.zip contratosMenoresPerfilesContratantes_2019.zip
If no output file is specified (through the -o
option), an out.parquet
file (in which all the entries of all the zip files are stitched together) is produced.
Appending new data to an existing (column-multiindexed) parquet file
We can append new data to an existing pd.DataFrame
. Let us, for instance, download, data from January 2022,
wget https://contrataciondelsectorpublico.gob.es/sindicacion/sindicacion_1044/PlataformasAgregadasSinMenores_202201.zip
and extend the previous parquet file with data extracted from the newly downloaded zip,
sproc_extend_parquet_with_zip.py 2018.parquet PlataformasAgregadasSinMenores_202201.zip 2018_202201.parquet
The combined data was saved in 2018_202201.parquet
.