extend

Extend an already existing pd.DataFrame with new data

These imports are not actually needed by the final exported code

import itertools

Historical data

Directory from which to read historical data

history_directory = pathlib.Path.cwd().parent / 'samples'
assert history_directory.exists()
print(history_directory)
/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples

Input directory for the pd.DataFrame to be updated

history_input_file = history_directory / '2018-2021_20samples.parquet'
assert history_input_file.exists()
print(history_input_file)
/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples/2018-2021_20samples.parquet

The specific file is

history_df = pd.read_parquet(history_input_file)
history_df.head(2)
id summary title ContractFolderStatus updated ContractFolderStatus deleted_on
ContractFolderID LocatedContractingParty ProcurementProject ... LocatedContractingParty TenderResult LocatedContractingParty TenderingProcess LocatedContractingParty TenderingProcess LocatedContractingParty ContractFolderStatusCode
Party Name TypeCode BudgetAmount RequiredCommodityClassification ... Party AwardedTenderedProject ParentLocatedParty TenderSubmissionDeadlinePeriod ParentLocatedParty TenderSubmissionDeadlinePeriod BuyerProfileURIID
PartyName EstimatedOverallContractAmount TaxExclusiveAmount ItemClassificationCode ... PartyIdentification ProcurementProjectLotID ParentLocatedParty ParentLocatedParty Description
Name ... ID ParentLocatedParty ParentLocatedParty
... PartyName ParentLocatedParty
... Name PartyName
... Name
zip file name entry
PlataformasAgregadasSinMenores_2018.zip PlataformasAgregadasSinMenores_20180217_180137_1.atom 453 https://contrataciondelestado.es/sindicacion/P... Expediente: 1284/17, Entidad: Diputación Provi... Refuerzo de Firme en la VP 3001 Renedo de Esgu... 1284/17 Diputación Provincial de Valladolid Refuerzo de Firme en la VP 3001 Renedo de Esgu... 3.0 89917.95 89917.95 [45233142.0] ... L02000047 [1.0] <NA> 2017-11-02 23:59:00+00:00 <NA> <NA> <NA> [2018-01-02 08:01:52.024000+00:00] [RES] NaT
452 https://contrataciondelestado.es/sindicacion/P... Expediente: 1282/17, Entidad: Diputación Provi... Refuerzo de Firme en la VP 6603 Mota del Marqu... 1282/17 Diputación Provincial de Valladolid Refuerzo de Firme en la VP 6603 Mota del Marqu... 3.0 175708.46 175708.46 [45233142.0] ... L02000047 [1.0] <NA> 2017-11-02 23:59:00+00:00 <NA> <NA> <NA> [2018-01-02 08:02:24.833000+00:00] [RES] NaT

2 rows × 43 columns

New data

The directory for the zip file whose data is to appended to the already existing pd.DataFrame

zip_directory = history_directory
# zip_directory = pathlib.Path.cwd() / 'data' / 'agregados'
assert zip_directory.exists()
print(zip_directory)
/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples

The particular file

zip_file = zip_directory / 'PlataformasAgregadasSinMenores_202201_05-06.zip'
assert zip_file.exists()
print(zip_file)
/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples/PlataformasAgregadasSinMenores_202201_05-06.zip

The file is read and processed in the usual way to get a pd.DataFrame

new_df = sproc.bundle.read_zip(zip_file, concatenate=True)

Multindexed columns are built

new_df = sproc.hier.flat_df_to_multiindexed_df(new_df)

Number of levels in the columns

new_df.columns.nlevels
6

Information about deleted entries

deleted_series = sproc.bundle.read_deleted_zip(zip_file)

Merge

Old and new data are stacked together

concatenated_df = sproc.assemble.stack(history_df, new_df)
concatenated_df
id summary title ContractFolderStatus updated ContractFolderStatus deleted_on
ContractFolderID LocatedContractingParty ProcurementProject ... LocatedContractingParty TenderResult LocatedContractingParty TenderingProcess LocatedContractingParty TenderingProcess LocatedContractingParty ContractFolderStatusCode
Party Name TypeCode BudgetAmount RequiredCommodityClassification ... Party AwardedTenderedProject ParentLocatedParty TenderSubmissionDeadlinePeriod ParentLocatedParty TenderSubmissionDeadlinePeriod BuyerProfileURIID
PartyName EstimatedOverallContractAmount TaxExclusiveAmount ItemClassificationCode ... PartyIdentification ProcurementProjectLotID ParentLocatedParty ParentLocatedParty Description
Name ... ID ParentLocatedParty ParentLocatedParty
... PartyName ParentLocatedParty
... Name PartyName
... Name
zip file name entry
PlataformasAgregadasSinMenores_2018.zip PlataformasAgregadasSinMenores_20180217_180137_1.atom 453 https://contrataciondelestado.es/sindicacion/P... Expediente: 1284/17, Entidad: Diputación Provi... Refuerzo de Firme en la VP 3001 Renedo de Esgu... 1284/17 Diputación Provincial de Valladolid Refuerzo de Firme en la VP 3001 Renedo de Esgu... 3.0 89917.95 89917.95 [45233142.0] ... L02000047 [1.0] <NA> 2017-11-02 23:59:00+00:00 <NA> <NA> <NA> [2018-01-02 08:01:52.024000+00:00] [RES] NaT
452 https://contrataciondelestado.es/sindicacion/P... Expediente: 1282/17, Entidad: Diputación Provi... Refuerzo de Firme en la VP 6603 Mota del Marqu... 1282/17 Diputación Provincial de Valladolid Refuerzo de Firme en la VP 6603 Mota del Marqu... 3.0 175708.46 175708.46 [45233142.0] ... L02000047 [1.0] <NA> 2017-11-02 23:59:00+00:00 <NA> <NA> <NA> [2018-01-02 08:02:24.833000+00:00] [RES] NaT
451 https://contrataciondelestado.es/sindicacion/P... Expediente: 1281/17, Entidad: Diputación Provi... Refuerzo de firme en la VP 4013 Melgar de Arri... 1281/17 Diputación Provincial de Valladolid Refuerzo de firme en la VP 4013 Melgar de Arri... 3.0 229259.52 229259.52 [45233142.0] ... L02000047 [1.0] <NA> 2017-11-02 23:59:00+00:00 <NA> <NA> <NA> [2018-01-02 08:02:51.744000+00:00] [RES] NaT
448 https://contrataciondelestado.es/sindicacion/P... Id licitación: 2017017; Órgano de Contratació... Desarrollo del programa de intervención socioe... 2017017 Alcalde del Ayuntamiento de Eibar Desarrollo del programa de intervención socioe... 2.0 704145.00 361100.00 [85310000.0] ... <NA> [nan] <NA> 2017-10-13 19:00:00+00:00 <NA> <NA> <NA> [2018-01-02 09:25:52.396000+00:00] [RES] NaT
447 https://contrataciondelestado.es/sindicacion/P... Id licitación: B2017002; Órgano de Contrataci... STAND DE EUSKADI EN FITUR Y SUS POSIBLES ADAP... B2017002 Dirección general de BASQUETOUR Diseño, construcción en régimen de alquiler, t... 2.0 1150000.00 175000.00 [39154100.0] ... <NA> [nan] <NA> 2017-10-23 14:00:00+00:00 <NA> <NA> <NA> [2018-01-02 09:25:52.501000+00:00] [RES] NaT
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
PlataformasAgregadasSinMenores_202201_05-06.zip PlataformasAgregadasSinMenores_20220106_030013.atom 471 https://contrataciondelestado.es/sindicacion/P... Id licitación: 2021/57-12334; Órgano de Contra... L'objecte d'aquest contracte és la prestació c... 2021/57-12334 Ajuntament de Palau-solità i Plegamans L'objecte d'aquest contracte és la prestació c... 1.0 77695.74 77695.74 35120000 ... <NA> NaN <NA> 2022-01-19 15:00:00+00:00 <NA> <NA> https://contractaciopublica.gencat.cat/ecofin_... [2022-01-04 12:12:09.949000+00:00] [PUB] NaT
472 https://contrataciondelestado.es/sindicacion/P... Id licitación: 32/2021; Órgano de Contratación... Licitació per procediment obert harmonitzat i ... 32/2021 Consell Comarcal del Baix Llobregat Licitació per procediment obert harmonitzat i ... 2.0 16274569.09 6027618.18 60100000 ... <NA> [[1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2... <NA> 2021-05-07 15:00:00+00:00 <NA> <NA> https://contractaciopublica.gencat.cat/ecofin_... [2022-01-04 12:12:09.742000+00:00] [RES] NaT
473 https://contrataciondelestado.es/sindicacion/P... Id licitación: 1977/2021; Órgano de Contrataci... L'objecte d'aquest contracte és la prestació d... 1977/2021 Ajuntament de Sant Adrià de Besòs L'objecte d'aquest contracte és la prestació d... 2.0 76507.20 34776.00 50710000 ... <NA> [[1, 2]] <NA> 2021-06-30 23:59:00+00:00 <NA> <NA> https://contractaciopublica.gencat.cat/ecofin_... [2022-01-04 12:12:09.666000+00:00] [RES] NaT
474 https://contrataciondelestado.es/sindicacion/P... Id licitación: EXI-2022-7; Órgano de Contratac... Servei de trasllat de béns mobles( mobiliari, ... EXI-2022-7 Departament d'Acció Exterior i Govern Obert Servei de trasllat de béns mobles( mobiliari, ... 2.0 38111.04 15879.60 60161000 ... <NA> NaN <NA> 2021-12-13 13:00:00+00:00 <NA> <NA> https://contractaciopublica.gencat.cat/ecofin_... [2022-01-04 12:12:09.602000+00:00] [ADJ] NaT
475 https://contrataciondelestado.es/sindicacion/P... Id licitación: CONTR/2021/000000137; Órgano de... Servei de redacció del Projecte Executiu d'obr... CONTR/2021/000000137 Institut d'Assistència Sanitària (IAS) Servei de redacció del Projecte Executiu d'obr... 2.0 40374.94 33645.78 71000000 ... <NA> NaN <NA> 2021-11-10 18:00:00+00:00 <NA> <NA> https://contractaciopublica.gencat.cat/ecofin_... [2022-01-04 12:12:09.536000+00:00] [ADJ] NaT

971 rows × 43 columns

Only the last update is kept

concatenated_df = sproc.postprocess.keep_updates_only(concatenated_df)
concatenated_df.head(2)
[('updated', '', '', '', '', '', '', ''), ('ContractFolderStatus', 'ContractFolderStatusCode', '', '', '', '', '', '')]
id summary title ContractFolderStatus deleted_on updated ContractFolderStatus
ContractFolderID LocatedContractingParty ProcurementProject ... LocatedContractingParty TenderResult LocatedContractingParty TenderingProcess LocatedContractingParty TenderingProcess LocatedContractingParty ContractFolderStatusCode
Party Name TypeCode BudgetAmount RequiredCommodityClassification ... Party AwardedTenderedProject ParentLocatedParty TenderSubmissionDeadlinePeriod ParentLocatedParty TenderSubmissionDeadlinePeriod BuyerProfileURIID
PartyName EstimatedOverallContractAmount TaxExclusiveAmount ItemClassificationCode ... PartyIdentification ProcurementProjectLotID ParentLocatedParty ParentLocatedParty Description
Name ... ID ParentLocatedParty ParentLocatedParty
... PartyName ParentLocatedParty
... Name PartyName
... Name
zip file name entry
PlataformasAgregadasSinMenores_2018.zip PlataformasAgregadasSinMenores_20180217_180137_1.atom 453 https://contrataciondelestado.es/sindicacion/P... Expediente: 1284/17, Entidad: Diputación Provi... Refuerzo de Firme en la VP 3001 Renedo de Esgu... 1284/17 Diputación Provincial de Valladolid Refuerzo de Firme en la VP 3001 Renedo de Esgu... 3.0 89917.95 89917.95 [45233142.0] ... L02000047 [1.0] <NA> 2017-11-02 23:59:00+00:00 <NA> <NA> <NA> NaT [2018-01-02 08:01:52.024000+00:00] [RES]
452 https://contrataciondelestado.es/sindicacion/P... Expediente: 1282/17, Entidad: Diputación Provi... Refuerzo de Firme en la VP 6603 Mota del Marqu... 1282/17 Diputación Provincial de Valladolid Refuerzo de Firme en la VP 6603 Mota del Marqu... 3.0 175708.46 175708.46 [45233142.0] ... L02000047 [1.0] <NA> 2017-11-02 23:59:00+00:00 <NA> <NA> <NA> NaT [2018-01-02 08:02:24.833000+00:00] [RES]

2 rows × 43 columns

In order to make it easy to update the deleted_on column in the existing pd.DataFrame using the data contained in the deleted pd.Series (i.e., to get automatic alignment) we - reindex the DataFrame using id

reindexed_concatenated_df = concatenated_df.reset_index().set_index(['id'])
reindexed_concatenated_df.head(2)
zip file name entry summary title ContractFolderStatus deleted_on updated ContractFolderStatus
ContractFolderID LocatedContractingParty ProcurementProject ... LocatedContractingParty TenderResult LocatedContractingParty TenderingProcess LocatedContractingParty TenderingProcess LocatedContractingParty ContractFolderStatusCode
Party Name TypeCode BudgetAmount ... Party AwardedTenderedProject ParentLocatedParty TenderSubmissionDeadlinePeriod ParentLocatedParty TenderSubmissionDeadlinePeriod BuyerProfileURIID
PartyName EstimatedOverallContractAmount ... PartyIdentification ProcurementProjectLotID ParentLocatedParty ParentLocatedParty Description
Name ... ID ParentLocatedParty ParentLocatedParty
... PartyName ParentLocatedParty
... Name PartyName
... Name
id
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1992428 PlataformasAgregadasSinMenores_2018.zip PlataformasAgregadasSinMenores_20180217_180137... 453 Expediente: 1284/17, Entidad: Diputación Provi... Refuerzo de Firme en la VP 3001 Renedo de Esgu... 1284/17 Diputación Provincial de Valladolid Refuerzo de Firme en la VP 3001 Renedo de Esgu... 3.0 89917.95 ... L02000047 [1.0] <NA> 2017-11-02 23:59:00+00:00 <NA> <NA> <NA> NaT [2018-01-02 08:01:52.024000+00:00] [RES]
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1992427 PlataformasAgregadasSinMenores_2018.zip PlataformasAgregadasSinMenores_20180217_180137... 452 Expediente: 1282/17, Entidad: Diputación Provi... Refuerzo de Firme en la VP 6603 Mota del Marqu... 1282/17 Diputación Provincial de Valladolid Refuerzo de Firme en la VP 6603 Mota del Marqu... 3.0 175708.46 ... L02000047 [1.0] <NA> 2017-11-02 23:59:00+00:00 <NA> <NA> <NA> NaT [2018-01-02 08:02:24.833000+00:00] [RES]

2 rows × 45 columns

  • drop the first level of the series (which leaves only id)
# reindexed_deleted_series = deleted_series.droplevel(0)
reindexed_deleted_series = deleted_series.droplevel([0, 1])
reindexed_deleted_series.head(2)
id
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/6724977   2022-01-04 00:12:01.376000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968315   2022-01-03 23:11:57.567000+00:00
Name: deleted_on, dtype: datetime64[ns, UTC]

Duplicates are removed from the above pd.Series

deduplicated_reindexed_deleted_series = sproc.postprocess.deduplicate_deleted_series(reindexed_deleted_series)
deduplicated_reindexed_deleted_series
id
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1957921   2022-01-03 23:11:56.535000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1959074   2022-01-03 23:11:56.497000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968266   2022-01-04 23:12:11.916000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968267   2022-01-04 23:12:12.775000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968268   2022-01-04 23:12:12.731000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968269   2022-01-04 23:12:13.013000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968271   2022-01-04 23:12:12.172000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968272   2022-01-04 23:12:12.957000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968279   2022-01-04 23:12:12.039000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968280   2022-01-04 23:12:12.865000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968282   2022-01-04 23:12:13.107000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968284   2022-01-04 23:12:12.911000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968285   2022-01-04 23:12:12.819000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968286   2022-01-04 23:12:13.147000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968287   2022-01-04 23:12:11.998000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968288   2022-01-04 23:12:11.957000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968289   2022-01-04 23:12:12.129000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968290   2022-01-04 23:12:12.635000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968291   2022-01-04 23:12:13.191000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968294   2022-01-04 23:12:13.328000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968299   2022-01-04 23:12:13.236000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968301   2022-01-04 23:12:13.283000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968302   2022-01-04 23:12:12.680000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968303   2022-01-04 23:12:13.058000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968305   2022-01-04 23:12:12.076000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968315   2022-01-03 23:11:57.567000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968317   2022-01-03 23:11:56.751000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968318   2022-01-03 23:11:57.038000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968319   2022-01-03 23:11:56.620000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968321   2022-01-03 23:11:57.323000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968324   2022-01-03 23:11:57.471000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968325   2022-01-03 23:11:56.579000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968327   2022-01-03 23:11:57.089000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968329   2022-01-03 23:11:56.850000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968331   2022-01-03 23:11:56.897000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968332   2022-01-03 23:11:56.949000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968333   2022-01-03 23:11:57.183000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968335   2022-01-03 23:11:57.275000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968337   2022-01-03 23:11:57.375000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968338   2022-01-03 23:11:56.805000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968343   2022-01-03 23:11:57.232000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968344   2022-01-03 23:11:57.426000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968345   2022-01-03 23:11:57.516000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968346   2022-01-03 23:11:56.661000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968358   2022-01-03 23:11:56.706000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968360   2022-01-03 23:11:57.141000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968361   2022-01-03 23:11:56.991000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/6724977   2022-01-04 00:12:01.376000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/8892704   2022-01-04 13:13:12.670000+00:00
Name: deleted_on, dtype: datetime64[ns, UTC]

The old and new sizes

reindexed_deleted_series.shape, deduplicated_reindexed_deleted_series.shape
((49,), (49,))

No keys are lost along the way

assert (set(reindexed_deleted_series.index) - set(deduplicated_reindexed_deleted_series.index)) == set()

Duplicates in the old pd.Series

reindexed_deleted_series.loc[reindexed_deleted_series.index.duplicated()]
Series([], Name: deleted_on, dtype: datetime64[ns, UTC])

There should be none in the new

assert deduplicated_reindexed_deleted_series.loc[deduplicated_reindexed_deleted_series.index.duplicated()].empty

What is the intersection between the entries in the new pd.DataFrame and the pd.Series with information about deletes (only the first elements).

list(itertools.islice(set(reindexed_concatenated_df.index) & set(deduplicated_reindexed_deleted_series.index), 0, 4))
[]

A deleted entry, identified by its id, and also showing the source ATOM file

deleted_entry = deduplicated_reindexed_deleted_series.index[0]
deleted_entry
'https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1957921'

It’s not actually present in the data

(new_df['id'] == deleted_entry).sum()
0

In order to see if something changes when updating column deleted_on, a backup of the old pd.DataFrame is made

# reindexed_concatenated_df = bak_reindexed_concatenated_df.copy()
bak_reindexed_concatenated_df = reindexed_concatenated_df.copy()

The columns deleted_on in the pd.DataFrame is updated for the indexes present in the deleted series. Notice that we cannot do

reindexed_concatenated_df['deleted_on'] = deduplicated_reindexed_deleted_series

because that would effectively wipe out the values of all the entries not present in the series.

na’s in the existing pd.DataFrame are filled in with data from the deleted series

reindexed_concatenated_df['deleted_on'] = reindexed_concatenated_df['deleted_on'].fillna(deduplicated_reindexed_deleted_series)

The number of non-NAs in the old and new pd.DataFrames

n_notnas_old, n_notnas_new = bak_reindexed_concatenated_df['deleted_on'].notna().sum(), reindexed_concatenated_df['deleted_on'].notna().sum()
n_notnas_old, n_notnas_new
(0, 0)

The latter be larger than or equal than the former

assert n_notnas_new >= n_notnas_old

The original index is reset (notice the reset_index() in the middle that avoids losing id columns, which is right now the index)

reindexed_concatenated_df = reindexed_concatenated_df.reset_index().set_index(['zip', 'file name', 'entry'])

Finally

reindexed_concatenated_df.head(2)
id summary title ContractFolderStatus deleted_on updated ContractFolderStatus
ContractFolderID LocatedContractingParty ProcurementProject ... LocatedContractingParty TenderResult LocatedContractingParty TenderingProcess LocatedContractingParty TenderingProcess LocatedContractingParty ContractFolderStatusCode
Party Name TypeCode BudgetAmount RequiredCommodityClassification ... Party AwardedTenderedProject ParentLocatedParty TenderSubmissionDeadlinePeriod ParentLocatedParty TenderSubmissionDeadlinePeriod BuyerProfileURIID
PartyName EstimatedOverallContractAmount TaxExclusiveAmount ItemClassificationCode ... PartyIdentification ProcurementProjectLotID ParentLocatedParty ParentLocatedParty Description
Name ... ID ParentLocatedParty ParentLocatedParty
... PartyName ParentLocatedParty
... Name PartyName
... Name
zip file name entry
PlataformasAgregadasSinMenores_2018.zip PlataformasAgregadasSinMenores_20180217_180137_1.atom 453 https://contrataciondelestado.es/sindicacion/P... Expediente: 1284/17, Entidad: Diputación Provi... Refuerzo de Firme en la VP 3001 Renedo de Esgu... 1284/17 Diputación Provincial de Valladolid Refuerzo de Firme en la VP 3001 Renedo de Esgu... 3.0 89917.95 89917.95 [45233142.0] ... L02000047 [1.0] <NA> 2017-11-02 23:59:00+00:00 <NA> <NA> <NA> NaT [2018-01-02 08:01:52.024000+00:00] [RES]
452 https://contrataciondelestado.es/sindicacion/P... Expediente: 1282/17, Entidad: Diputación Provi... Refuerzo de Firme en la VP 6603 Mota del Marqu... 1282/17 Diputación Provincial de Valladolid Refuerzo de Firme en la VP 6603 Mota del Marqu... 3.0 175708.46 175708.46 [45233142.0] ... L02000047 [1.0] <NA> 2017-11-02 23:59:00+00:00 <NA> <NA> <NA> NaT [2018-01-02 08:02:24.833000+00:00] [RES]

2 rows × 43 columns

A function encompassing all of the above steps in one go


source

df_with_zip

 df_with_zip (history_df:pandas.core.frame.DataFrame,
              zip_file:str|pathlib.Path,
              output_file:str|pathlib.Path|None=None)

Extend an existing DataFrame with the data in a zip file

Type Default Details
history_df DataFrame DataFrame to be extended
zip_file str | pathlib.Path Zip file with new data
output_file str | pathlib.Path | None None Output file (optional)
Returns None | pandas.core.frame.DataFrame Extended DataFrame or nothing if an output_file was passed

For historical reasons, a slightly higher-level function is included


source

parquet_with_zip

 parquet_with_zip (history_file:str|pathlib.Path,
                   zip_file:str|pathlib.Path,
                   output_file:str|pathlib.Path|None=None)

Extend an existing parquet file with the data in a zip file

Type Default Details
history_file str | pathlib.Path DataFrame to be extended
zip_file str | pathlib.Path Zip file with new data
output_file str | pathlib.Path | None None Output file (optional)
Returns None | pandas.core.frame.DataFrame Extended DataFrame or nothing if an output_file was passed
merged_df = parquet_with_zip(history_input_file, zip_file)
merged_df.head(2)
[('updated', '', '', '', '', '', '', ''), ('ContractFolderStatus', 'ContractFolderStatusCode', '', '', '', '', '', '')]
id summary title ContractFolderStatus deleted_on updated ContractFolderStatus
ContractFolderID LocatedContractingParty ProcurementProject ... LocatedContractingParty TenderResult LocatedContractingParty TenderingProcess LocatedContractingParty TenderingProcess LocatedContractingParty ContractFolderStatusCode
Party Name TypeCode BudgetAmount RequiredCommodityClassification ... Party AwardedTenderedProject ParentLocatedParty TenderSubmissionDeadlinePeriod ParentLocatedParty TenderSubmissionDeadlinePeriod BuyerProfileURIID
PartyName EstimatedOverallContractAmount TaxExclusiveAmount ItemClassificationCode ... PartyIdentification ProcurementProjectLotID ParentLocatedParty ParentLocatedParty Description
Name ... ID ParentLocatedParty ParentLocatedParty
... PartyName ParentLocatedParty
... Name PartyName
... Name
zip file name entry
PlataformasAgregadasSinMenores_2018.zip PlataformasAgregadasSinMenores_20180217_180137_1.atom 453 https://contrataciondelestado.es/sindicacion/P... Expediente: 1284/17, Entidad: Diputación Provi... Refuerzo de Firme en la VP 3001 Renedo de Esgu... 1284/17 Diputación Provincial de Valladolid Refuerzo de Firme en la VP 3001 Renedo de Esgu... 3.0 89917.95 89917.95 [45233142.0] ... L02000047 [1.0] <NA> 2017-11-02 23:59:00+00:00 <NA> <NA> <NA> NaT [2018-01-02 08:01:52.024000+00:00] [RES]
452 https://contrataciondelestado.es/sindicacion/P... Expediente: 1282/17, Entidad: Diputación Provi... Refuerzo de Firme en la VP 6603 Mota del Marqu... 1282/17 Diputación Provincial de Valladolid Refuerzo de Firme en la VP 6603 Mota del Marqu... 3.0 175708.46 175708.46 [45233142.0] ... L02000047 [1.0] <NA> 2017-11-02 23:59:00+00:00 <NA> <NA> <NA> NaT [2018-01-02 08:02:24.833000+00:00] [RES]

2 rows × 43 columns

We should be getting exactly the same result as above

assert reindexed_concatenated_df.equals(merged_df)
output_file = history_directory / 'merged.parquet'
parquet_with_zip(history_input_file, zip_file, output_file)
[('updated', '', '', '', '', '', '', ''), ('ContractFolderStatus', 'ContractFolderStatusCode', '', '', '', '', '', '')]
sample_output_df = pd.read_parquet(output_file)
sample_output_df.head(2)
id summary title ContractFolderStatus deleted_on updated ContractFolderStatus
ContractFolderID LocatedContractingParty ProcurementProject ... LocatedContractingParty TenderResult LocatedContractingParty TenderingProcess LocatedContractingParty TenderingProcess LocatedContractingParty ContractFolderStatusCode
Party Name TypeCode BudgetAmount RequiredCommodityClassification ... Party AwardedTenderedProject ParentLocatedParty TenderSubmissionDeadlinePeriod ParentLocatedParty TenderSubmissionDeadlinePeriod BuyerProfileURIID
PartyName EstimatedOverallContractAmount TaxExclusiveAmount ItemClassificationCode ... PartyIdentification ProcurementProjectLotID ParentLocatedParty ParentLocatedParty Description
Name ... ID ParentLocatedParty ParentLocatedParty
... PartyName ParentLocatedParty
... Name PartyName
... Name
zip file name entry
PlataformasAgregadasSinMenores_2018.zip PlataformasAgregadasSinMenores_20180217_180137_1.atom 453 https://contrataciondelestado.es/sindicacion/P... Expediente: 1284/17, Entidad: Diputación Provi... Refuerzo de Firme en la VP 3001 Renedo de Esgu... 1284/17 Diputación Provincial de Valladolid Refuerzo de Firme en la VP 3001 Renedo de Esgu... 3.0 89917.95 89917.95 [45233142.0] ... L02000047 [1.0] <NA> 2017-11-02 23:59:00+00:00 <NA> <NA> <NA> None [2018-01-02 08:01:52.024000+00:00] [RES]
452 https://contrataciondelestado.es/sindicacion/P... Expediente: 1282/17, Entidad: Diputación Provi... Refuerzo de Firme en la VP 6603 Mota del Marqu... 1282/17 Diputación Provincial de Valladolid Refuerzo de Firme en la VP 6603 Mota del Marqu... 3.0 175708.46 175708.46 [45233142.0] ... L02000047 [1.0] <NA> 2017-11-02 23:59:00+00:00 <NA> <NA> <NA> None [2018-01-02 08:02:24.833000+00:00] [RES]

2 rows × 43 columns

Let us make sure nothing was lost

assert sample_output_df.shape == merged_df.shape