= pathlib.Path.cwd().parent / 'samples'
directory assert directory.exists()
directory
Path('/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples')
Directory where the zip files are stored
Path('/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples')
A function to read a single zip file from the command line. One could also achieve the same result with cli_read_zips
, though this is slightly more efficient.
cli_read_single_zip (args:list=None)
Parses command-line arguments to read a single zip file exploiting the functionality of sproc.assemble
Type | Default | Details | |
---|---|---|---|
args | list | None | Command-line arguments |
Returns | None |
zip_file = directory /'yearly' / 'PlataformasAgregadasSinMenores_2018.zip'
assert zip_file.exists()
print(f'{zip_file=}')
zip_file=Path('/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples/yearly/PlataformasAgregadasSinMenores_2018.zip')
output_file=Path('/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples/year_2018.parquet')
2018-2021_20samples.parquet
extended_sample.parquet
extended_sample_renamed.parquet
gencat/
insiders_sample.parquet
merged.parquet
minors_sample.parquet
PLACE.yaml
PlataformasAgregadasSinMenores_20220104_030016_1.atom
PlataformasAgregadasSinMenores_20220104_030016_1_single.atom
PlataformasAgregadasSinMenores_202201_05-06.zip
PlataformasAgregadasSinMenores_202201_08-11.zip
PlataformasAgregadasSinMenores_202201_28-29.zip
README.md
renamed_cols_extended_sample.parquet
year_2018.parquet
yearly/
id | summary | title | ContractFolderStatus | updated | ContractFolderStatus | deleted_on | |||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ContractFolderID | LocatedContractingParty | ProcurementProject | ... | TechnicalDocumentReference | LocatedContractingParty | TenderingProcess | ContractFolderStatusCode | ||||||||||||||||
Party | Name | TypeCode | BudgetAmount | ... | ID | Attachment | ParentLocatedParty | ParticipationRequestReceptionPeriod | TenderSubmissionDeadlinePeriod | ||||||||||||||
PartyIdentification | PartyName | EstimatedOverallContractAmount | TaxExclusiveAmount | ... | ExternalReference | ParentLocatedParty | EndDate | EndTime | |||||||||||||||
ID | Name | ... | URI | ParentLocatedParty | |||||||||||||||||||
... | PartyName | ParentLocatedParty | |||||||||||||||||||||
... | Name | PartyName | |||||||||||||||||||||
... | Name | ||||||||||||||||||||||
zip | file name | entry | |||||||||||||||||||||
PlataformasAgregadasSinMenores_2018.zip | PlataformasAgregadasSinMenores_20180217_180137_1.atom | 453 | https://contrataciondelestado.es/sindicacion/P... | Expediente: 1284/17, Entidad: Diputación Provi... | Refuerzo de Firme en la VP 3001 Renedo de Esgu... | 1284/17 | L02000047 | Diputación Provincial de Valladolid | Refuerzo de Firme en la VP 3001 Renedo de Esgu... | 3.0 | 89917.95 | 89917.95 | ... | <NA> | <NA> | <NA> | <NA> | 2017-11-02 | 23:59:00 | 2017-11-02 23:59:00+00:00 | [2018-01-02 08:01:52.024000+00:00] | [RES] | NaT |
452 | https://contrataciondelestado.es/sindicacion/P... | Expediente: 1282/17, Entidad: Diputación Provi... | Refuerzo de Firme en la VP 6603 Mota del Marqu... | 1282/17 | L02000047 | Diputación Provincial de Valladolid | Refuerzo de Firme en la VP 6603 Mota del Marqu... | 3.0 | 175708.46 | 175708.46 | ... | <NA> | <NA> | <NA> | <NA> | 2017-11-02 | 23:59:00 | 2017-11-02 23:59:00+00:00 | [2018-01-02 08:02:24.833000+00:00] | [RES] | NaT |
2 rows × 41 columns
A function to extend an existing parquet file with new data in a zip file.
cli_extend_parquet_with_zip (args:list=None)
Parses command-line arguments to be passed to sproc.extend.parquet_with_zip
Type | Default | Details | |
---|---|---|---|
args | list | None | Command-line arguments |
Returns | None |
Testing with some sample files
history_file = directory /'2018-2021_20samples.parquet'
assert history_file.exists()
print(f'{history_file=}')
history_file=Path('/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples/2018-2021_20samples.parquet')
new_zip_file = directory / 'PlataformasAgregadasSinMenores_202201_28-29.zip'
assert new_zip_file.exists()
print(f'{new_zip_file=}')
new_zip_file=Path('/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples/PlataformasAgregadasSinMenores_202201_28-29.zip')
Path('/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples/extended_sample.parquet')
2018-2021_20samples.parquet
extended_sample.parquet
extended_sample_renamed.parquet
gencat/
insiders_sample.parquet
merged.parquet
minors_sample.parquet
PLACE.yaml
PlataformasAgregadasSinMenores_20220104_030016_1.atom
PlataformasAgregadasSinMenores_20220104_030016_1_single.atom
PlataformasAgregadasSinMenores_202201_05-06.zip
PlataformasAgregadasSinMenores_202201_08-11.zip
PlataformasAgregadasSinMenores_202201_28-29.zip
README.md
renamed_cols_extended_sample.parquet
year_2018.parquet
yearly/
A function to flatten a hierarchical (column-multiindex) pd.DataFrame
using a given naming scheme or a default one.
cli_rename_columns (args:list=None)
Parses command-line arguments to be passed to sproc.hier.flatten_columns_names
Type | Default | Details | |
---|---|---|---|
args | list | None | Command-line arguments |
Returns | Path | Output file |
A local file encompassing a name mapping…
Path('/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples/PLACE.yaml')
…is used to rename the columns
args = [output_file.as_posix(), '--from-local-file', mapping_file.as_posix()]
renamed_cols_output_file = cli_rename_columns(args)
renamed_cols_output_file_df = pd.read_parquet(renamed_cols_output_file).head(2)
renamed_cols_output_file_df
id | summary | title | Número de Expediente | Nombre | Objeto del Contrato | Tipo de Contrato | Valor estimado del contrato | Presupuesto base sin impuestos | Clasificación CPV | ... | ID | Lote | ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - ParentLocatedParty - ParentLocatedParty - PartyName - Name | Presentación de Oferta | ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - ParentLocatedParty - ParentLocatedParty - ParentLocatedParty - PartyName - Name | Presentación de Oferta (Observaciones) | URL perfil de contratante | deleted_on | updated | Estado | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
zip | file name | entry | |||||||||||||||||||||
PlataformasAgregadasSinMenores_2018.zip | PlataformasAgregadasSinMenores_20180217_180137_1.atom | 453 | https://contrataciondelestado.es/sindicacion/P... | Expediente: 1284/17, Entidad: Diputación Provi... | Refuerzo de Firme en la VP 3001 Renedo de Esgu... | 1284/17 | Diputación Provincial de Valladolid | Refuerzo de Firme en la VP 3001 Renedo de Esgu... | 3.0 | 89917.95 | 89917.95 | [45233142.0] | ... | L02000047 | [1.0] | <NA> | 2017-11-02 23:59:00+00:00 | <NA> | <NA> | <NA> | NaT | [2018-01-02 08:01:52.024000+00:00] | [RES] |
452 | https://contrataciondelestado.es/sindicacion/P... | Expediente: 1282/17, Entidad: Diputación Provi... | Refuerzo de Firme en la VP 6603 Mota del Marqu... | 1282/17 | Diputación Provincial de Valladolid | Refuerzo de Firme en la VP 6603 Mota del Marqu... | 3.0 | 175708.46 | 175708.46 | [45233142.0] | ... | L02000047 | [1.0] | <NA> | 2017-11-02 23:59:00+00:00 | <NA> | <NA> | <NA> | NaT | [2018-01-02 08:02:24.833000+00:00] | [RES] |
2 rows × 43 columns
Columns are not multiindexed anymore
Index(['id', 'summary', 'title', 'Número de Expediente', 'Nombre'], dtype='object')
A remote naming
scheme stored in the repository can also be used
It receives a list
of zip files and returns a (column-hierarchical) pd.DataFrame
encompassing all the data
read_zips (files:list[str|pathlib.Path])
Build a DataFrame
out of a bunch of zip files
Type | Details | |
---|---|---|
files | list | Input files |
Returns | DataFrame | Procurement data |
Let us pick a couple of files for testing
A companion function to allow using the above from the command-line.
cli_read_zips (args:list=None)
Parses command-line arguments to be passed to read_zips
Type | Default | Details | |
---|---|---|---|
args | list | None | Command-line arguments |
Returns | None |
Core function to download new data and updated existing local structures.
dl (kind:str, output_directory:str|pathlib.Path)
Download data or update local one
Type | Details | |
---|---|---|
kind | str | One of ‘outsiders’, ‘insiders’, or ‘minors’ |
output_directory | str | pathlib.Path | The path where data is to be stored |
A companion function to allow using the above from the command-line.
cli_dl (args:list=None)
Parses command-line arguments to be passed to dl
Type | Default | Details | |
---|---|---|---|
args | list | None | Command-line arguments |
Returns | None |