core

Main functionality.

Directory where the zip files are stored

directory = pathlib.Path.cwd().parent / 'samples'
assert directory.exists()
directory
Path('/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples')

Processing a single zip file

A function to read a single zip file from the command line. One could also achieve the same result with cli_read_zips, though this is slightly more efficient.


source

cli_read_single_zip

 cli_read_single_zip (args:list=None)

Parses command-line arguments to read a single zip file exploiting the functionality of sproc.assemble

Type Default Details
args list None Command-line arguments
Returns None
zip_file = directory /'yearly' / 'PlataformasAgregadasSinMenores_2018.zip'
assert zip_file.exists()
print(f'{zip_file=}')
zip_file=Path('/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples/yearly/PlataformasAgregadasSinMenores_2018.zip')
output_file = directory / 'year_2018.parquet'
print(f'{output_file=}')
output_file=Path('/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples/year_2018.parquet')
args = [zip_file.as_posix(), output_file.as_posix()]
cli_read_single_zip(args)
2018-2021_20samples.parquet
extended_sample.parquet
extended_sample_renamed.parquet
gencat/
insiders_sample.parquet
merged.parquet
minors_sample.parquet
PLACE.yaml
PlataformasAgregadasSinMenores_20220104_030016_1.atom
PlataformasAgregadasSinMenores_20220104_030016_1_single.atom
PlataformasAgregadasSinMenores_202201_05-06.zip
PlataformasAgregadasSinMenores_202201_08-11.zip
PlataformasAgregadasSinMenores_202201_28-29.zip
README.md
renamed_cols_extended_sample.parquet
year_2018.parquet
yearly/
pd.read_parquet(output_file).head(2)
id summary title ContractFolderStatus updated ContractFolderStatus deleted_on
ContractFolderID LocatedContractingParty ProcurementProject ... TechnicalDocumentReference LocatedContractingParty TenderingProcess ContractFolderStatusCode
Party Name TypeCode BudgetAmount ... ID Attachment ParentLocatedParty ParticipationRequestReceptionPeriod TenderSubmissionDeadlinePeriod
PartyIdentification PartyName EstimatedOverallContractAmount TaxExclusiveAmount ... ExternalReference ParentLocatedParty EndDate EndTime
ID Name ... URI ParentLocatedParty
... PartyName ParentLocatedParty
... Name PartyName
... Name
zip file name entry
PlataformasAgregadasSinMenores_2018.zip PlataformasAgregadasSinMenores_20180217_180137_1.atom 453 https://contrataciondelestado.es/sindicacion/P... Expediente: 1284/17, Entidad: Diputación Provi... Refuerzo de Firme en la VP 3001 Renedo de Esgu... 1284/17 L02000047 Diputación Provincial de Valladolid Refuerzo de Firme en la VP 3001 Renedo de Esgu... 3.0 89917.95 89917.95 ... <NA> <NA> <NA> <NA> 2017-11-02 23:59:00 2017-11-02 23:59:00+00:00 [2018-01-02 08:01:52.024000+00:00] [RES] NaT
452 https://contrataciondelestado.es/sindicacion/P... Expediente: 1282/17, Entidad: Diputación Provi... Refuerzo de Firme en la VP 6603 Mota del Marqu... 1282/17 L02000047 Diputación Provincial de Valladolid Refuerzo de Firme en la VP 6603 Mota del Marqu... 3.0 175708.46 175708.46 ... <NA> <NA> <NA> <NA> 2017-11-02 23:59:00 2017-11-02 23:59:00+00:00 [2018-01-02 08:02:24.833000+00:00] [RES] NaT

2 rows × 41 columns

Extending historical data

A function to extend an existing parquet file with new data in a zip file.


source

cli_extend_parquet_with_zip

 cli_extend_parquet_with_zip (args:list=None)

Parses command-line arguments to be passed to sproc.extend.parquet_with_zip

Type Default Details
args list None Command-line arguments
Returns None

Testing with some sample files

history_file = directory /'2018-2021_20samples.parquet'
assert history_file.exists()
print(f'{history_file=}')
history_file=Path('/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples/2018-2021_20samples.parquet')
new_zip_file = directory / 'PlataformasAgregadasSinMenores_202201_28-29.zip'
assert new_zip_file.exists()
print(f'{new_zip_file=}')
new_zip_file=Path('/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples/PlataformasAgregadasSinMenores_202201_28-29.zip')
output_file = directory / 'extended_sample.parquet'
output_file
Path('/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples/extended_sample.parquet')
args = [history_file.as_posix(), new_zip_file.as_posix(), output_file.as_posix()]
cli_extend_parquet_with_zip(args)
2018-2021_20samples.parquet
extended_sample.parquet
extended_sample_renamed.parquet
gencat/
insiders_sample.parquet
merged.parquet
minors_sample.parquet
PLACE.yaml
PlataformasAgregadasSinMenores_20220104_030016_1.atom
PlataformasAgregadasSinMenores_20220104_030016_1_single.atom
PlataformasAgregadasSinMenores_202201_05-06.zip
PlataformasAgregadasSinMenores_202201_08-11.zip
PlataformasAgregadasSinMenores_202201_28-29.zip
README.md
renamed_cols_extended_sample.parquet
year_2018.parquet
yearly/

Renaming columns

A function to flatten a hierarchical (column-multiindex) pd.DataFrame using a given naming scheme or a default one.


source

cli_rename_columns

 cli_rename_columns (args:list=None)

Parses command-line arguments to be passed to sproc.hier.flatten_columns_names

Type Default Details
args list None Command-line arguments
Returns Path Output file

A local file encompassing a name mapping

mapping_file = directory / 'PLACE.yaml'
assert mapping_file.exists()
mapping_file
Path('/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples/PLACE.yaml')

…is used to rename the columns

args = [output_file.as_posix(), '--from-local-file', mapping_file.as_posix()]
renamed_cols_output_file = cli_rename_columns(args)
renamed_cols_output_file_df = pd.read_parquet(renamed_cols_output_file).head(2)
renamed_cols_output_file_df
id summary title Número de Expediente Nombre Objeto del Contrato Tipo de Contrato Valor estimado del contrato Presupuesto base sin impuestos Clasificación CPV ... ID Lote ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - ParentLocatedParty - ParentLocatedParty - PartyName - Name Presentación de Oferta ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - ParentLocatedParty - ParentLocatedParty - ParentLocatedParty - PartyName - Name Presentación de Oferta (Observaciones) URL perfil de contratante deleted_on updated Estado
zip file name entry
PlataformasAgregadasSinMenores_2018.zip PlataformasAgregadasSinMenores_20180217_180137_1.atom 453 https://contrataciondelestado.es/sindicacion/P... Expediente: 1284/17, Entidad: Diputación Provi... Refuerzo de Firme en la VP 3001 Renedo de Esgu... 1284/17 Diputación Provincial de Valladolid Refuerzo de Firme en la VP 3001 Renedo de Esgu... 3.0 89917.95 89917.95 [45233142.0] ... L02000047 [1.0] <NA> 2017-11-02 23:59:00+00:00 <NA> <NA> <NA> NaT [2018-01-02 08:01:52.024000+00:00] [RES]
452 https://contrataciondelestado.es/sindicacion/P... Expediente: 1282/17, Entidad: Diputación Provi... Refuerzo de Firme en la VP 6603 Mota del Marqu... 1282/17 Diputación Provincial de Valladolid Refuerzo de Firme en la VP 6603 Mota del Marqu... 3.0 175708.46 175708.46 [45233142.0] ... L02000047 [1.0] <NA> 2017-11-02 23:59:00+00:00 <NA> <NA> <NA> NaT [2018-01-02 08:02:24.833000+00:00] [RES]

2 rows × 43 columns

Columns are not multiindexed anymore

renamed_cols_output_file_df.columns[:5]
Index(['id', 'summary', 'title', 'Número de Expediente', 'Nombre'], dtype='object')

A remote naming scheme stored in the repository can also be used

args = [output_file.as_posix(), '--from-repository-file', 'outsiders.yaml']
renamed_cols_output_file = cli_rename_columns(args)
pd.read_parquet(renamed_cols_output_file).head(2)

Reading a bunch of zip files

It receives a list of zip files and returns a (column-hierarchical) pd.DataFrame encompassing all the data


source

read_zips

 read_zips (files:list[str|pathlib.Path])

Build a DataFrame out of a bunch of zip files

Type Details
files list Input files
Returns DataFrame Procurement data

Let us pick a couple of files for testing

zip_files = ['PlataformasAgregadasSinMenores_2018.zip', 'PlataformasAgregadasSinMenores_2019.zip']
zip_files = [directory/ 'yearly' / e for e in zip_files]
zip_files
df = read_zips(zip_files)
df.head()

CLI

A companion function to allow using the above from the command-line.


source

cli_read_zips

 cli_read_zips (args:list=None)

Parses command-line arguments to be passed to read_zips

Type Default Details
args list None Command-line arguments
Returns None
cli_read_zips([e.as_posix() for e in zip_files] + '-o o.parquet'.split())

Downloading new data

Core function to download new data and updated existing local structures.


source

dl

 dl (kind:str, output_directory:str|pathlib.Path)

Download data or update local one

Type Details
kind str One of ‘outsiders’, ‘insiders’, or ‘minors’
output_directory str | pathlib.Path The path where data is to be stored

CLI

A companion function to allow using the above from the command-line.


source

cli_dl

 cli_dl (args:list=None)

Parses command-line arguments to be passed to dl

Type Default Details
args list None Command-line arguments
Returns None
# output_directory = pathlib.Path.cwd().parent / 'data' / 'plataforma'
# args = ['outsiders', '-o', output_directory.as_posix()]
# cli_update(args)