core

Main functionality.

Directory where the zip files are stored

directory = pathlib.Path.cwd().parent / 'samples'
assert directory.exists()
directory

Path('/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples')

Processing a single zip file

A function to read a single zip file from the command line. One could also achieve the same result with cli_read_zips, though this is slightly more efficient.

source

cli_read_single_zip

 cli_read_single_zip (args:list=None)

Parses command-line arguments to read a single zip file exploiting the functionality of sproc.assemble

	Type	Default	Details
args	list	None	Command-line arguments
Returns	None

zip_file = directory /'yearly' / 'PlataformasAgregadasSinMenores_2018.zip'
assert zip_file.exists()
print(f'{zip_file=}')

zip_file=Path('/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples/yearly/PlataformasAgregadasSinMenores_2018.zip')

output_file = directory / 'year_2018.parquet'
print(f'{output_file=}')

output_file=Path('/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples/year_2018.parquet')

args = [zip_file.as_posix(), output_file.as_posix()]
cli_read_single_zip(args)

2018-2021_20samples.parquet
extended_sample.parquet
extended_sample_renamed.parquet
gencat/
insiders_sample.parquet
merged.parquet
minors_sample.parquet
PLACE.yaml
PlataformasAgregadasSinMenores_20220104_030016_1.atom
PlataformasAgregadasSinMenores_20220104_030016_1_single.atom
PlataformasAgregadasSinMenores_202201_05-06.zip
PlataformasAgregadasSinMenores_202201_08-11.zip
PlataformasAgregadasSinMenores_202201_28-29.zip
README.md
renamed_cols_extended_sample.parquet
year_2018.parquet
yearly/

pd.read_parquet(output_file).head(2)

			id	summary	title	ContractFolderStatus															updated	ContractFolderStatus	deleted_on
						ContractFolderID	LocatedContractingParty		ProcurementProject				...	TechnicalDocumentReference		LocatedContractingParty		TenderingProcess				ContractFolderStatusCode
							Party		Name	TypeCode	BudgetAmount		...	ID	Attachment	ParentLocatedParty		ParticipationRequestReceptionPeriod		TenderSubmissionDeadlinePeriod
							PartyIdentification	PartyName			EstimatedOverallContractAmount	TaxExclusiveAmount	...		ExternalReference	ParentLocatedParty		EndDate	EndTime
							ID	Name					...		URI	ParentLocatedParty
													...			PartyName	ParentLocatedParty
													...			Name	PartyName
													...				Name
zip	file name	entry
PlataformasAgregadasSinMenores_2018.zip	PlataformasAgregadasSinMenores_20180217_180137_1.atom	453	https://contrataciondelestado.es/sindicacion/P...	Expediente: 1284/17, Entidad: Diputación Provi...	Refuerzo de Firme en la VP 3001 Renedo de Esgu...	1284/17	L02000047	Diputación Provincial de Valladolid	Refuerzo de Firme en la VP 3001 Renedo de Esgu...	3.0	89917.95	89917.95	...	<NA>	<NA>	<NA>	<NA>	2017-11-02	23:59:00	2017-11-02 23:59:00+00:00	[2018-01-02 08:01:52.024000+00:00]	[RES]	NaT
PlataformasAgregadasSinMenores_2018.zip	PlataformasAgregadasSinMenores_20180217_180137_1.atom	452	https://contrataciondelestado.es/sindicacion/P...	Expediente: 1282/17, Entidad: Diputación Provi...	Refuerzo de Firme en la VP 6603 Mota del Marqu...	1282/17	L02000047	Diputación Provincial de Valladolid	Refuerzo de Firme en la VP 6603 Mota del Marqu...	3.0	175708.46	175708.46	...	<NA>	<NA>	<NA>	<NA>	2017-11-02	23:59:00	2017-11-02 23:59:00+00:00	[2018-01-02 08:02:24.833000+00:00]	[RES]	NaT

2 rows × 41 columns

Extending historical data

A function to extend an existing parquet file with new data in a zip file.

source

cli_extend_parquet_with_zip

 cli_extend_parquet_with_zip (args:list=None)

Parses command-line arguments to be passed to sproc.extend.parquet_with_zip

	Type	Default	Details
args	list	None	Command-line arguments
Returns	None

Testing with some sample files

history_file = directory /'2018-2021_20samples.parquet'
assert history_file.exists()
print(f'{history_file=}')

history_file=Path('/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples/2018-2021_20samples.parquet')

new_zip_file = directory / 'PlataformasAgregadasSinMenores_202201_28-29.zip'
assert new_zip_file.exists()
print(f'{new_zip_file=}')

new_zip_file=Path('/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples/PlataformasAgregadasSinMenores_202201_28-29.zip')

output_file = directory / 'extended_sample.parquet'
output_file

Path('/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples/extended_sample.parquet')

args = [history_file.as_posix(), new_zip_file.as_posix(), output_file.as_posix()]
cli_extend_parquet_with_zip(args)

2018-2021_20samples.parquet
extended_sample.parquet
extended_sample_renamed.parquet
gencat/
insiders_sample.parquet
merged.parquet
minors_sample.parquet
PLACE.yaml
PlataformasAgregadasSinMenores_20220104_030016_1.atom
PlataformasAgregadasSinMenores_20220104_030016_1_single.atom
PlataformasAgregadasSinMenores_202201_05-06.zip
PlataformasAgregadasSinMenores_202201_08-11.zip
PlataformasAgregadasSinMenores_202201_28-29.zip
README.md
renamed_cols_extended_sample.parquet
year_2018.parquet
yearly/

Renaming columns

A function to flatten a hierarchical (column-multiindex) pd.DataFrame using a given naming scheme or a default one.

source

cli_rename_columns

 cli_rename_columns (args:list=None)

Parses command-line arguments to be passed to sproc.hier.flatten_columns_names

	Type	Default	Details
args	list	None	Command-line arguments
Returns	Path		Output file

A local file encompassing a name mapping…

mapping_file = directory / 'PLACE.yaml'
assert mapping_file.exists()
mapping_file

Path('/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples/PLACE.yaml')

…is used to rename the columns

args = [output_file.as_posix(), '--from-local-file', mapping_file.as_posix()]
renamed_cols_output_file = cli_rename_columns(args)
renamed_cols_output_file_df = pd.read_parquet(renamed_cols_output_file).head(2)
renamed_cols_output_file_df

			id	summary	title	Número de Expediente	Nombre	Objeto del Contrato	Tipo de Contrato	Valor estimado del contrato	Presupuesto base sin impuestos	Clasificación CPV	...	ID	Lote	ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - ParentLocatedParty - ParentLocatedParty - PartyName - Name	Presentación de Oferta	ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - ParentLocatedParty - ParentLocatedParty - ParentLocatedParty - PartyName - Name	Presentación de Oferta (Observaciones)	URL perfil de contratante	deleted_on	updated	Estado
zip	file name	entry
PlataformasAgregadasSinMenores_2018.zip	PlataformasAgregadasSinMenores_20180217_180137_1.atom	453	https://contrataciondelestado.es/sindicacion/P...	Expediente: 1284/17, Entidad: Diputación Provi...	Refuerzo de Firme en la VP 3001 Renedo de Esgu...	1284/17	Diputación Provincial de Valladolid	Refuerzo de Firme en la VP 3001 Renedo de Esgu...	3.0	89917.95	89917.95	[45233142.0]	...	L02000047	[1.0]	<NA>	2017-11-02 23:59:00+00:00	<NA>	<NA>	<NA>	NaT	[2018-01-02 08:01:52.024000+00:00]	[RES]
PlataformasAgregadasSinMenores_2018.zip	PlataformasAgregadasSinMenores_20180217_180137_1.atom	452	https://contrataciondelestado.es/sindicacion/P...	Expediente: 1282/17, Entidad: Diputación Provi...	Refuerzo de Firme en la VP 6603 Mota del Marqu...	1282/17	Diputación Provincial de Valladolid	Refuerzo de Firme en la VP 6603 Mota del Marqu...	3.0	175708.46	175708.46	[45233142.0]	...	L02000047	[1.0]	<NA>	2017-11-02 23:59:00+00:00	<NA>	<NA>	<NA>	NaT	[2018-01-02 08:02:24.833000+00:00]	[RES]

2 rows × 43 columns

Columns are not multiindexed anymore

renamed_cols_output_file_df.columns[:5]

Index(['id', 'summary', 'title', 'Número de Expediente', 'Nombre'], dtype='object')

A remote naming scheme stored in the repository can also be used

args = [output_file.as_posix(), '--from-repository-file', 'outsiders.yaml']
renamed_cols_output_file = cli_rename_columns(args)
pd.read_parquet(renamed_cols_output_file).head(2)

Reading a bunch of zip files

It receives a list of zip files and returns a (column-hierarchical) pd.DataFrame encompassing all the data

source

read_zips

 read_zips (files:list[str|pathlib.Path])

Build a DataFrame out of a bunch of zip files

	Type	Details
files	list	Input files
Returns	DataFrame	Procurement data

Let us pick a couple of files for testing

zip_files = ['PlataformasAgregadasSinMenores_2018.zip', 'PlataformasAgregadasSinMenores_2019.zip']
zip_files = [directory/ 'yearly' / e for e in zip_files]
zip_files

df = read_zips(zip_files)
df.head()

CLI

A companion function to allow using the above from the command-line.

source

cli_read_zips

 cli_read_zips (args:list=None)

Parses command-line arguments to be passed to read_zips

	Type	Default	Details
args	list	None	Command-line arguments
Returns	None

cli_read_zips([e.as_posix() for e in zip_files] + '-o o.parquet'.split())

Downloading new data

Core function to download new data and updated existing local structures.

source

dl

 dl (kind:str, output_directory:str|pathlib.Path)

Download data or update local one

	Type	Details
kind	str	One of ‘outsiders’, ‘insiders’, or ‘minors’
output_directory	str \| pathlib.Path	The path where data is to be stored

CLI

A companion function to allow using the above from the command-line.

source

cli_dl

 cli_dl (args:list=None)

Parses command-line arguments to be passed to dl

	Type	Default	Details
args	list	None	Command-line arguments
Returns	None

# output_directory = pathlib.Path.cwd().parent / 'data' / 'plataforma'
# args = ['outsiders', '-o', output_directory.as_posix()]
# cli_update(args)