import itertoolsextend
pd.DataFrame with new data
These imports are not actually needed by the final exported code
Historical data
Directory from which to read historical data
history_directory = pathlib.Path.cwd().parent / 'samples'
assert history_directory.exists()
print(history_directory)/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples
Input directory for the pd.DataFrame to be updated
history_input_file = history_directory / '2018-2021_20samples.parquet'
assert history_input_file.exists()
print(history_input_file)/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples/2018-2021_20samples.parquet
The specific file is
history_df = pd.read_parquet(history_input_file)
history_df.head(2)| id | summary | title | ContractFolderStatus | updated | ContractFolderStatus | deleted_on | |||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ContractFolderID | LocatedContractingParty | ProcurementProject | ... | LocatedContractingParty | TenderResult | LocatedContractingParty | TenderingProcess | LocatedContractingParty | TenderingProcess | LocatedContractingParty | ContractFolderStatusCode | ||||||||||||
| Party | Name | TypeCode | BudgetAmount | RequiredCommodityClassification | ... | Party | AwardedTenderedProject | ParentLocatedParty | TenderSubmissionDeadlinePeriod | ParentLocatedParty | TenderSubmissionDeadlinePeriod | BuyerProfileURIID | |||||||||||
| PartyName | EstimatedOverallContractAmount | TaxExclusiveAmount | ItemClassificationCode | ... | PartyIdentification | ProcurementProjectLotID | ParentLocatedParty | ParentLocatedParty | Description | ||||||||||||||
| Name | ... | ID | ParentLocatedParty | ParentLocatedParty | |||||||||||||||||||
| ... | PartyName | ParentLocatedParty | |||||||||||||||||||||
| ... | Name | PartyName | |||||||||||||||||||||
| ... | Name | ||||||||||||||||||||||
| zip | file name | entry | |||||||||||||||||||||
| PlataformasAgregadasSinMenores_2018.zip | PlataformasAgregadasSinMenores_20180217_180137_1.atom | 453 | https://contrataciondelestado.es/sindicacion/P... | Expediente: 1284/17, Entidad: Diputación Provi... | Refuerzo de Firme en la VP 3001 Renedo de Esgu... | 1284/17 | Diputación Provincial de Valladolid | Refuerzo de Firme en la VP 3001 Renedo de Esgu... | 3.0 | 89917.95 | 89917.95 | [45233142.0] | ... | L02000047 | [1.0] | <NA> | 2017-11-02 23:59:00+00:00 | <NA> | <NA> | <NA> | [2018-01-02 08:01:52.024000+00:00] | [RES] | NaT |
| 452 | https://contrataciondelestado.es/sindicacion/P... | Expediente: 1282/17, Entidad: Diputación Provi... | Refuerzo de Firme en la VP 6603 Mota del Marqu... | 1282/17 | Diputación Provincial de Valladolid | Refuerzo de Firme en la VP 6603 Mota del Marqu... | 3.0 | 175708.46 | 175708.46 | [45233142.0] | ... | L02000047 | [1.0] | <NA> | 2017-11-02 23:59:00+00:00 | <NA> | <NA> | <NA> | [2018-01-02 08:02:24.833000+00:00] | [RES] | NaT | ||
2 rows × 43 columns
New data
The directory for the zip file whose data is to appended to the already existing pd.DataFrame
zip_directory = history_directory
# zip_directory = pathlib.Path.cwd() / 'data' / 'agregados'
assert zip_directory.exists()
print(zip_directory)/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples
The particular file
zip_file = zip_directory / 'PlataformasAgregadasSinMenores_202201_05-06.zip'
assert zip_file.exists()
print(zip_file)/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples/PlataformasAgregadasSinMenores_202201_05-06.zip
The file is read and processed in the usual way to get a pd.DataFrame
new_df = sproc.bundle.read_zip(zip_file, concatenate=True)Multindexed columns are built
new_df = sproc.hier.flat_df_to_multiindexed_df(new_df)Number of levels in the columns
new_df.columns.nlevels6
Information about deleted entries
deleted_series = sproc.bundle.read_deleted_zip(zip_file)Merge
Old and new data are stacked together
concatenated_df = sproc.assemble.stack(history_df, new_df)
concatenated_df| id | summary | title | ContractFolderStatus | updated | ContractFolderStatus | deleted_on | |||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ContractFolderID | LocatedContractingParty | ProcurementProject | ... | LocatedContractingParty | TenderResult | LocatedContractingParty | TenderingProcess | LocatedContractingParty | TenderingProcess | LocatedContractingParty | ContractFolderStatusCode | ||||||||||||
| Party | Name | TypeCode | BudgetAmount | RequiredCommodityClassification | ... | Party | AwardedTenderedProject | ParentLocatedParty | TenderSubmissionDeadlinePeriod | ParentLocatedParty | TenderSubmissionDeadlinePeriod | BuyerProfileURIID | |||||||||||
| PartyName | EstimatedOverallContractAmount | TaxExclusiveAmount | ItemClassificationCode | ... | PartyIdentification | ProcurementProjectLotID | ParentLocatedParty | ParentLocatedParty | Description | ||||||||||||||
| Name | ... | ID | ParentLocatedParty | ParentLocatedParty | |||||||||||||||||||
| ... | PartyName | ParentLocatedParty | |||||||||||||||||||||
| ... | Name | PartyName | |||||||||||||||||||||
| ... | Name | ||||||||||||||||||||||
| zip | file name | entry | |||||||||||||||||||||
| PlataformasAgregadasSinMenores_2018.zip | PlataformasAgregadasSinMenores_20180217_180137_1.atom | 453 | https://contrataciondelestado.es/sindicacion/P... | Expediente: 1284/17, Entidad: Diputación Provi... | Refuerzo de Firme en la VP 3001 Renedo de Esgu... | 1284/17 | Diputación Provincial de Valladolid | Refuerzo de Firme en la VP 3001 Renedo de Esgu... | 3.0 | 89917.95 | 89917.95 | [45233142.0] | ... | L02000047 | [1.0] | <NA> | 2017-11-02 23:59:00+00:00 | <NA> | <NA> | <NA> | [2018-01-02 08:01:52.024000+00:00] | [RES] | NaT |
| 452 | https://contrataciondelestado.es/sindicacion/P... | Expediente: 1282/17, Entidad: Diputación Provi... | Refuerzo de Firme en la VP 6603 Mota del Marqu... | 1282/17 | Diputación Provincial de Valladolid | Refuerzo de Firme en la VP 6603 Mota del Marqu... | 3.0 | 175708.46 | 175708.46 | [45233142.0] | ... | L02000047 | [1.0] | <NA> | 2017-11-02 23:59:00+00:00 | <NA> | <NA> | <NA> | [2018-01-02 08:02:24.833000+00:00] | [RES] | NaT | ||
| 451 | https://contrataciondelestado.es/sindicacion/P... | Expediente: 1281/17, Entidad: Diputación Provi... | Refuerzo de firme en la VP 4013 Melgar de Arri... | 1281/17 | Diputación Provincial de Valladolid | Refuerzo de firme en la VP 4013 Melgar de Arri... | 3.0 | 229259.52 | 229259.52 | [45233142.0] | ... | L02000047 | [1.0] | <NA> | 2017-11-02 23:59:00+00:00 | <NA> | <NA> | <NA> | [2018-01-02 08:02:51.744000+00:00] | [RES] | NaT | ||
| 448 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: 2017017; Órgano de Contratació... | Desarrollo del programa de intervención socioe... | 2017017 | Alcalde del Ayuntamiento de Eibar | Desarrollo del programa de intervención socioe... | 2.0 | 704145.00 | 361100.00 | [85310000.0] | ... | <NA> | [nan] | <NA> | 2017-10-13 19:00:00+00:00 | <NA> | <NA> | <NA> | [2018-01-02 09:25:52.396000+00:00] | [RES] | NaT | ||
| 447 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: B2017002; Órgano de Contrataci... | STAND DE EUSKADI EN FITUR Y SUS POSIBLES ADAP... | B2017002 | Dirección general de BASQUETOUR | Diseño, construcción en régimen de alquiler, t... | 2.0 | 1150000.00 | 175000.00 | [39154100.0] | ... | <NA> | [nan] | <NA> | 2017-10-23 14:00:00+00:00 | <NA> | <NA> | <NA> | [2018-01-02 09:25:52.501000+00:00] | [RES] | NaT | ||
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| PlataformasAgregadasSinMenores_202201_05-06.zip | PlataformasAgregadasSinMenores_20220106_030013.atom | 471 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: 2021/57-12334; Órgano de Contra... | L'objecte d'aquest contracte és la prestació c... | 2021/57-12334 | Ajuntament de Palau-solità i Plegamans | L'objecte d'aquest contracte és la prestació c... | 1.0 | 77695.74 | 77695.74 | 35120000 | ... | <NA> | NaN | <NA> | 2022-01-19 15:00:00+00:00 | <NA> | <NA> | https://contractaciopublica.gencat.cat/ecofin_... | [2022-01-04 12:12:09.949000+00:00] | [PUB] | NaT |
| 472 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: 32/2021; Órgano de Contratación... | Licitació per procediment obert harmonitzat i ... | 32/2021 | Consell Comarcal del Baix Llobregat | Licitació per procediment obert harmonitzat i ... | 2.0 | 16274569.09 | 6027618.18 | 60100000 | ... | <NA> | [[1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2... | <NA> | 2021-05-07 15:00:00+00:00 | <NA> | <NA> | https://contractaciopublica.gencat.cat/ecofin_... | [2022-01-04 12:12:09.742000+00:00] | [RES] | NaT | ||
| 473 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: 1977/2021; Órgano de Contrataci... | L'objecte d'aquest contracte és la prestació d... | 1977/2021 | Ajuntament de Sant Adrià de Besòs | L'objecte d'aquest contracte és la prestació d... | 2.0 | 76507.20 | 34776.00 | 50710000 | ... | <NA> | [[1, 2]] | <NA> | 2021-06-30 23:59:00+00:00 | <NA> | <NA> | https://contractaciopublica.gencat.cat/ecofin_... | [2022-01-04 12:12:09.666000+00:00] | [RES] | NaT | ||
| 474 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: EXI-2022-7; Órgano de Contratac... | Servei de trasllat de béns mobles( mobiliari, ... | EXI-2022-7 | Departament d'Acció Exterior i Govern Obert | Servei de trasllat de béns mobles( mobiliari, ... | 2.0 | 38111.04 | 15879.60 | 60161000 | ... | <NA> | NaN | <NA> | 2021-12-13 13:00:00+00:00 | <NA> | <NA> | https://contractaciopublica.gencat.cat/ecofin_... | [2022-01-04 12:12:09.602000+00:00] | [ADJ] | NaT | ||
| 475 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: CONTR/2021/000000137; Órgano de... | Servei de redacció del Projecte Executiu d'obr... | CONTR/2021/000000137 | Institut d'Assistència Sanitària (IAS) | Servei de redacció del Projecte Executiu d'obr... | 2.0 | 40374.94 | 33645.78 | 71000000 | ... | <NA> | NaN | <NA> | 2021-11-10 18:00:00+00:00 | <NA> | <NA> | https://contractaciopublica.gencat.cat/ecofin_... | [2022-01-04 12:12:09.536000+00:00] | [ADJ] | NaT | ||
971 rows × 43 columns
Only the last update is kept
concatenated_df = sproc.postprocess.keep_updates_only(concatenated_df)
concatenated_df.head(2)[('updated', '', '', '', '', '', '', ''), ('ContractFolderStatus', 'ContractFolderStatusCode', '', '', '', '', '', '')]
| id | summary | title | ContractFolderStatus | deleted_on | updated | ContractFolderStatus | |||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ContractFolderID | LocatedContractingParty | ProcurementProject | ... | LocatedContractingParty | TenderResult | LocatedContractingParty | TenderingProcess | LocatedContractingParty | TenderingProcess | LocatedContractingParty | ContractFolderStatusCode | ||||||||||||
| Party | Name | TypeCode | BudgetAmount | RequiredCommodityClassification | ... | Party | AwardedTenderedProject | ParentLocatedParty | TenderSubmissionDeadlinePeriod | ParentLocatedParty | TenderSubmissionDeadlinePeriod | BuyerProfileURIID | |||||||||||
| PartyName | EstimatedOverallContractAmount | TaxExclusiveAmount | ItemClassificationCode | ... | PartyIdentification | ProcurementProjectLotID | ParentLocatedParty | ParentLocatedParty | Description | ||||||||||||||
| Name | ... | ID | ParentLocatedParty | ParentLocatedParty | |||||||||||||||||||
| ... | PartyName | ParentLocatedParty | |||||||||||||||||||||
| ... | Name | PartyName | |||||||||||||||||||||
| ... | Name | ||||||||||||||||||||||
| zip | file name | entry | |||||||||||||||||||||
| PlataformasAgregadasSinMenores_2018.zip | PlataformasAgregadasSinMenores_20180217_180137_1.atom | 453 | https://contrataciondelestado.es/sindicacion/P... | Expediente: 1284/17, Entidad: Diputación Provi... | Refuerzo de Firme en la VP 3001 Renedo de Esgu... | 1284/17 | Diputación Provincial de Valladolid | Refuerzo de Firme en la VP 3001 Renedo de Esgu... | 3.0 | 89917.95 | 89917.95 | [45233142.0] | ... | L02000047 | [1.0] | <NA> | 2017-11-02 23:59:00+00:00 | <NA> | <NA> | <NA> | NaT | [2018-01-02 08:01:52.024000+00:00] | [RES] |
| 452 | https://contrataciondelestado.es/sindicacion/P... | Expediente: 1282/17, Entidad: Diputación Provi... | Refuerzo de Firme en la VP 6603 Mota del Marqu... | 1282/17 | Diputación Provincial de Valladolid | Refuerzo de Firme en la VP 6603 Mota del Marqu... | 3.0 | 175708.46 | 175708.46 | [45233142.0] | ... | L02000047 | [1.0] | <NA> | 2017-11-02 23:59:00+00:00 | <NA> | <NA> | <NA> | NaT | [2018-01-02 08:02:24.833000+00:00] | [RES] | ||
2 rows × 43 columns
In order to make it easy to update the deleted_on column in the existing pd.DataFrame using the data contained in the deleted pd.Series (i.e., to get automatic alignment) we - reindex the DataFrame using id
reindexed_concatenated_df = concatenated_df.reset_index().set_index(['id'])
reindexed_concatenated_df.head(2)| zip | file name | entry | summary | title | ContractFolderStatus | deleted_on | updated | ContractFolderStatus | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ContractFolderID | LocatedContractingParty | ProcurementProject | ... | LocatedContractingParty | TenderResult | LocatedContractingParty | TenderingProcess | LocatedContractingParty | TenderingProcess | LocatedContractingParty | ContractFolderStatusCode | ||||||||||
| Party | Name | TypeCode | BudgetAmount | ... | Party | AwardedTenderedProject | ParentLocatedParty | TenderSubmissionDeadlinePeriod | ParentLocatedParty | TenderSubmissionDeadlinePeriod | BuyerProfileURIID | ||||||||||
| PartyName | EstimatedOverallContractAmount | ... | PartyIdentification | ProcurementProjectLotID | ParentLocatedParty | ParentLocatedParty | Description | ||||||||||||||
| Name | ... | ID | ParentLocatedParty | ParentLocatedParty | |||||||||||||||||
| ... | PartyName | ParentLocatedParty | |||||||||||||||||||
| ... | Name | PartyName | |||||||||||||||||||
| ... | Name | ||||||||||||||||||||
| id | |||||||||||||||||||||
| https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1992428 | PlataformasAgregadasSinMenores_2018.zip | PlataformasAgregadasSinMenores_20180217_180137... | 453 | Expediente: 1284/17, Entidad: Diputación Provi... | Refuerzo de Firme en la VP 3001 Renedo de Esgu... | 1284/17 | Diputación Provincial de Valladolid | Refuerzo de Firme en la VP 3001 Renedo de Esgu... | 3.0 | 89917.95 | ... | L02000047 | [1.0] | <NA> | 2017-11-02 23:59:00+00:00 | <NA> | <NA> | <NA> | NaT | [2018-01-02 08:01:52.024000+00:00] | [RES] |
| https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1992427 | PlataformasAgregadasSinMenores_2018.zip | PlataformasAgregadasSinMenores_20180217_180137... | 452 | Expediente: 1282/17, Entidad: Diputación Provi... | Refuerzo de Firme en la VP 6603 Mota del Marqu... | 1282/17 | Diputación Provincial de Valladolid | Refuerzo de Firme en la VP 6603 Mota del Marqu... | 3.0 | 175708.46 | ... | L02000047 | [1.0] | <NA> | 2017-11-02 23:59:00+00:00 | <NA> | <NA> | <NA> | NaT | [2018-01-02 08:02:24.833000+00:00] | [RES] |
2 rows × 45 columns
- drop the first level of the series (which leaves only
id)
# reindexed_deleted_series = deleted_series.droplevel(0)
reindexed_deleted_series = deleted_series.droplevel([0, 1])
reindexed_deleted_series.head(2)id
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/6724977 2022-01-04 00:12:01.376000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968315 2022-01-03 23:11:57.567000+00:00
Name: deleted_on, dtype: datetime64[ns, UTC]
Duplicates are removed from the above pd.Series
deduplicated_reindexed_deleted_series = sproc.postprocess.deduplicate_deleted_series(reindexed_deleted_series)
deduplicated_reindexed_deleted_seriesid
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1957921 2022-01-03 23:11:56.535000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1959074 2022-01-03 23:11:56.497000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968266 2022-01-04 23:12:11.916000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968267 2022-01-04 23:12:12.775000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968268 2022-01-04 23:12:12.731000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968269 2022-01-04 23:12:13.013000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968271 2022-01-04 23:12:12.172000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968272 2022-01-04 23:12:12.957000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968279 2022-01-04 23:12:12.039000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968280 2022-01-04 23:12:12.865000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968282 2022-01-04 23:12:13.107000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968284 2022-01-04 23:12:12.911000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968285 2022-01-04 23:12:12.819000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968286 2022-01-04 23:12:13.147000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968287 2022-01-04 23:12:11.998000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968288 2022-01-04 23:12:11.957000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968289 2022-01-04 23:12:12.129000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968290 2022-01-04 23:12:12.635000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968291 2022-01-04 23:12:13.191000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968294 2022-01-04 23:12:13.328000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968299 2022-01-04 23:12:13.236000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968301 2022-01-04 23:12:13.283000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968302 2022-01-04 23:12:12.680000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968303 2022-01-04 23:12:13.058000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968305 2022-01-04 23:12:12.076000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968315 2022-01-03 23:11:57.567000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968317 2022-01-03 23:11:56.751000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968318 2022-01-03 23:11:57.038000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968319 2022-01-03 23:11:56.620000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968321 2022-01-03 23:11:57.323000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968324 2022-01-03 23:11:57.471000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968325 2022-01-03 23:11:56.579000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968327 2022-01-03 23:11:57.089000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968329 2022-01-03 23:11:56.850000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968331 2022-01-03 23:11:56.897000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968332 2022-01-03 23:11:56.949000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968333 2022-01-03 23:11:57.183000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968335 2022-01-03 23:11:57.275000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968337 2022-01-03 23:11:57.375000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968338 2022-01-03 23:11:56.805000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968343 2022-01-03 23:11:57.232000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968344 2022-01-03 23:11:57.426000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968345 2022-01-03 23:11:57.516000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968346 2022-01-03 23:11:56.661000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968358 2022-01-03 23:11:56.706000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968360 2022-01-03 23:11:57.141000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968361 2022-01-03 23:11:56.991000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/6724977 2022-01-04 00:12:01.376000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/8892704 2022-01-04 13:13:12.670000+00:00
Name: deleted_on, dtype: datetime64[ns, UTC]
The old and new sizes
reindexed_deleted_series.shape, deduplicated_reindexed_deleted_series.shape((49,), (49,))
No keys are lost along the way
assert (set(reindexed_deleted_series.index) - set(deduplicated_reindexed_deleted_series.index)) == set()Duplicates in the old pd.Series
reindexed_deleted_series.loc[reindexed_deleted_series.index.duplicated()]Series([], Name: deleted_on, dtype: datetime64[ns, UTC])
There should be none in the new
assert deduplicated_reindexed_deleted_series.loc[deduplicated_reindexed_deleted_series.index.duplicated()].emptyWhat is the intersection between the entries in the new pd.DataFrame and the pd.Series with information about deletes (only the first elements).
list(itertools.islice(set(reindexed_concatenated_df.index) & set(deduplicated_reindexed_deleted_series.index), 0, 4))[]
A deleted entry, identified by its id, and also showing the source ATOM file
deleted_entry = deduplicated_reindexed_deleted_series.index[0]
deleted_entry'https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1957921'
It’s not actually present in the data
(new_df['id'] == deleted_entry).sum()0
In order to see if something changes when updating column deleted_on, a backup of the old pd.DataFrame is made
# reindexed_concatenated_df = bak_reindexed_concatenated_df.copy()
bak_reindexed_concatenated_df = reindexed_concatenated_df.copy()The columns deleted_on in the pd.DataFrame is updated for the indexes present in the deleted series. Notice that we cannot do
reindexed_concatenated_df['deleted_on'] = deduplicated_reindexed_deleted_series
because that would effectively wipe out the values of all the entries not present in the series.
na’s in the existing pd.DataFrame are filled in with data from the deleted series
reindexed_concatenated_df['deleted_on'] = reindexed_concatenated_df['deleted_on'].fillna(deduplicated_reindexed_deleted_series)The number of non-NAs in the old and new pd.DataFrames
n_notnas_old, n_notnas_new = bak_reindexed_concatenated_df['deleted_on'].notna().sum(), reindexed_concatenated_df['deleted_on'].notna().sum()
n_notnas_old, n_notnas_new(0, 0)
The latter be larger than or equal than the former
assert n_notnas_new >= n_notnas_oldThe original index is reset (notice the reset_index() in the middle that avoids losing id columns, which is right now the index)
reindexed_concatenated_df = reindexed_concatenated_df.reset_index().set_index(['zip', 'file name', 'entry'])Finally
reindexed_concatenated_df.head(2)| id | summary | title | ContractFolderStatus | deleted_on | updated | ContractFolderStatus | |||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ContractFolderID | LocatedContractingParty | ProcurementProject | ... | LocatedContractingParty | TenderResult | LocatedContractingParty | TenderingProcess | LocatedContractingParty | TenderingProcess | LocatedContractingParty | ContractFolderStatusCode | ||||||||||||
| Party | Name | TypeCode | BudgetAmount | RequiredCommodityClassification | ... | Party | AwardedTenderedProject | ParentLocatedParty | TenderSubmissionDeadlinePeriod | ParentLocatedParty | TenderSubmissionDeadlinePeriod | BuyerProfileURIID | |||||||||||
| PartyName | EstimatedOverallContractAmount | TaxExclusiveAmount | ItemClassificationCode | ... | PartyIdentification | ProcurementProjectLotID | ParentLocatedParty | ParentLocatedParty | Description | ||||||||||||||
| Name | ... | ID | ParentLocatedParty | ParentLocatedParty | |||||||||||||||||||
| ... | PartyName | ParentLocatedParty | |||||||||||||||||||||
| ... | Name | PartyName | |||||||||||||||||||||
| ... | Name | ||||||||||||||||||||||
| zip | file name | entry | |||||||||||||||||||||
| PlataformasAgregadasSinMenores_2018.zip | PlataformasAgregadasSinMenores_20180217_180137_1.atom | 453 | https://contrataciondelestado.es/sindicacion/P... | Expediente: 1284/17, Entidad: Diputación Provi... | Refuerzo de Firme en la VP 3001 Renedo de Esgu... | 1284/17 | Diputación Provincial de Valladolid | Refuerzo de Firme en la VP 3001 Renedo de Esgu... | 3.0 | 89917.95 | 89917.95 | [45233142.0] | ... | L02000047 | [1.0] | <NA> | 2017-11-02 23:59:00+00:00 | <NA> | <NA> | <NA> | NaT | [2018-01-02 08:01:52.024000+00:00] | [RES] |
| 452 | https://contrataciondelestado.es/sindicacion/P... | Expediente: 1282/17, Entidad: Diputación Provi... | Refuerzo de Firme en la VP 6603 Mota del Marqu... | 1282/17 | Diputación Provincial de Valladolid | Refuerzo de Firme en la VP 6603 Mota del Marqu... | 3.0 | 175708.46 | 175708.46 | [45233142.0] | ... | L02000047 | [1.0] | <NA> | 2017-11-02 23:59:00+00:00 | <NA> | <NA> | <NA> | NaT | [2018-01-02 08:02:24.833000+00:00] | [RES] | ||
2 rows × 43 columns
A function encompassing all of the above steps in one go
df_with_zip
df_with_zip (history_df:pandas.core.frame.DataFrame, zip_file:str|pathlib.Path, output_file:str|pathlib.Path|None=None)
Extend an existing DataFrame with the data in a zip file
| Type | Default | Details | |
|---|---|---|---|
| history_df | DataFrame | DataFrame to be extended | |
| zip_file | str | pathlib.Path | Zip file with new data | |
| output_file | str | pathlib.Path | None | None | Output file (optional) |
| Returns | None | pandas.core.frame.DataFrame | Extended DataFrame or nothing if an output_file was passed |
For historical reasons, a slightly higher-level function is included
parquet_with_zip
parquet_with_zip (history_file:str|pathlib.Path, zip_file:str|pathlib.Path, output_file:str|pathlib.Path|None=None)
Extend an existing parquet file with the data in a zip file
| Type | Default | Details | |
|---|---|---|---|
| history_file | str | pathlib.Path | DataFrame to be extended | |
| zip_file | str | pathlib.Path | Zip file with new data | |
| output_file | str | pathlib.Path | None | None | Output file (optional) |
| Returns | None | pandas.core.frame.DataFrame | Extended DataFrame or nothing if an output_file was passed |
merged_df = parquet_with_zip(history_input_file, zip_file)
merged_df.head(2)[('updated', '', '', '', '', '', '', ''), ('ContractFolderStatus', 'ContractFolderStatusCode', '', '', '', '', '', '')]
| id | summary | title | ContractFolderStatus | deleted_on | updated | ContractFolderStatus | |||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ContractFolderID | LocatedContractingParty | ProcurementProject | ... | LocatedContractingParty | TenderResult | LocatedContractingParty | TenderingProcess | LocatedContractingParty | TenderingProcess | LocatedContractingParty | ContractFolderStatusCode | ||||||||||||
| Party | Name | TypeCode | BudgetAmount | RequiredCommodityClassification | ... | Party | AwardedTenderedProject | ParentLocatedParty | TenderSubmissionDeadlinePeriod | ParentLocatedParty | TenderSubmissionDeadlinePeriod | BuyerProfileURIID | |||||||||||
| PartyName | EstimatedOverallContractAmount | TaxExclusiveAmount | ItemClassificationCode | ... | PartyIdentification | ProcurementProjectLotID | ParentLocatedParty | ParentLocatedParty | Description | ||||||||||||||
| Name | ... | ID | ParentLocatedParty | ParentLocatedParty | |||||||||||||||||||
| ... | PartyName | ParentLocatedParty | |||||||||||||||||||||
| ... | Name | PartyName | |||||||||||||||||||||
| ... | Name | ||||||||||||||||||||||
| zip | file name | entry | |||||||||||||||||||||
| PlataformasAgregadasSinMenores_2018.zip | PlataformasAgregadasSinMenores_20180217_180137_1.atom | 453 | https://contrataciondelestado.es/sindicacion/P... | Expediente: 1284/17, Entidad: Diputación Provi... | Refuerzo de Firme en la VP 3001 Renedo de Esgu... | 1284/17 | Diputación Provincial de Valladolid | Refuerzo de Firme en la VP 3001 Renedo de Esgu... | 3.0 | 89917.95 | 89917.95 | [45233142.0] | ... | L02000047 | [1.0] | <NA> | 2017-11-02 23:59:00+00:00 | <NA> | <NA> | <NA> | NaT | [2018-01-02 08:01:52.024000+00:00] | [RES] |
| 452 | https://contrataciondelestado.es/sindicacion/P... | Expediente: 1282/17, Entidad: Diputación Provi... | Refuerzo de Firme en la VP 6603 Mota del Marqu... | 1282/17 | Diputación Provincial de Valladolid | Refuerzo de Firme en la VP 6603 Mota del Marqu... | 3.0 | 175708.46 | 175708.46 | [45233142.0] | ... | L02000047 | [1.0] | <NA> | 2017-11-02 23:59:00+00:00 | <NA> | <NA> | <NA> | NaT | [2018-01-02 08:02:24.833000+00:00] | [RES] | ||
2 rows × 43 columns
We should be getting exactly the same result as above
assert reindexed_concatenated_df.equals(merged_df)output_file = history_directory / 'merged.parquet'
parquet_with_zip(history_input_file, zip_file, output_file)[('updated', '', '', '', '', '', '', ''), ('ContractFolderStatus', 'ContractFolderStatusCode', '', '', '', '', '', '')]
sample_output_df = pd.read_parquet(output_file)
sample_output_df.head(2)| id | summary | title | ContractFolderStatus | deleted_on | updated | ContractFolderStatus | |||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ContractFolderID | LocatedContractingParty | ProcurementProject | ... | LocatedContractingParty | TenderResult | LocatedContractingParty | TenderingProcess | LocatedContractingParty | TenderingProcess | LocatedContractingParty | ContractFolderStatusCode | ||||||||||||
| Party | Name | TypeCode | BudgetAmount | RequiredCommodityClassification | ... | Party | AwardedTenderedProject | ParentLocatedParty | TenderSubmissionDeadlinePeriod | ParentLocatedParty | TenderSubmissionDeadlinePeriod | BuyerProfileURIID | |||||||||||
| PartyName | EstimatedOverallContractAmount | TaxExclusiveAmount | ItemClassificationCode | ... | PartyIdentification | ProcurementProjectLotID | ParentLocatedParty | ParentLocatedParty | Description | ||||||||||||||
| Name | ... | ID | ParentLocatedParty | ParentLocatedParty | |||||||||||||||||||
| ... | PartyName | ParentLocatedParty | |||||||||||||||||||||
| ... | Name | PartyName | |||||||||||||||||||||
| ... | Name | ||||||||||||||||||||||
| zip | file name | entry | |||||||||||||||||||||
| PlataformasAgregadasSinMenores_2018.zip | PlataformasAgregadasSinMenores_20180217_180137_1.atom | 453 | https://contrataciondelestado.es/sindicacion/P... | Expediente: 1284/17, Entidad: Diputación Provi... | Refuerzo de Firme en la VP 3001 Renedo de Esgu... | 1284/17 | Diputación Provincial de Valladolid | Refuerzo de Firme en la VP 3001 Renedo de Esgu... | 3.0 | 89917.95 | 89917.95 | [45233142.0] | ... | L02000047 | [1.0] | <NA> | 2017-11-02 23:59:00+00:00 | <NA> | <NA> | <NA> | None | [2018-01-02 08:01:52.024000+00:00] | [RES] |
| 452 | https://contrataciondelestado.es/sindicacion/P... | Expediente: 1282/17, Entidad: Diputación Provi... | Refuerzo de Firme en la VP 6603 Mota del Marqu... | 1282/17 | Diputación Provincial de Valladolid | Refuerzo de Firme en la VP 6603 Mota del Marqu... | 3.0 | 175708.46 | 175708.46 | [45233142.0] | ... | L02000047 | [1.0] | <NA> | 2017-11-02 23:59:00+00:00 | <NA> | <NA> | <NA> | None | [2018-01-02 08:02:24.833000+00:00] | [RES] | ||
2 rows × 43 columns
Let us make sure nothing was lost
assert sample_output_df.shape == merged_df.shape