import itertools
extend
pd.DataFrame
with new data
These import
s are not actually needed by the final exported code
Historical data
Directory from which to read historical data
= pathlib.Path.cwd().parent / 'samples'
history_directory assert history_directory.exists()
print(history_directory)
/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples
Input directory for the pd.DataFrame
to be updated
= history_directory / '2018-2021_20samples.parquet'
history_input_file assert history_input_file.exists()
print(history_input_file)
/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples/2018-2021_20samples.parquet
The specific file is
= pd.read_parquet(history_input_file)
history_df 2) history_df.head(
id | summary | title | ContractFolderStatus | updated | ContractFolderStatus | deleted_on | |||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ContractFolderID | LocatedContractingParty | ProcurementProject | ... | LocatedContractingParty | TenderResult | LocatedContractingParty | TenderingProcess | LocatedContractingParty | TenderingProcess | LocatedContractingParty | ContractFolderStatusCode | ||||||||||||
Party | Name | TypeCode | BudgetAmount | RequiredCommodityClassification | ... | Party | AwardedTenderedProject | ParentLocatedParty | TenderSubmissionDeadlinePeriod | ParentLocatedParty | TenderSubmissionDeadlinePeriod | BuyerProfileURIID | |||||||||||
PartyName | EstimatedOverallContractAmount | TaxExclusiveAmount | ItemClassificationCode | ... | PartyIdentification | ProcurementProjectLotID | ParentLocatedParty | ParentLocatedParty | Description | ||||||||||||||
Name | ... | ID | ParentLocatedParty | ParentLocatedParty | |||||||||||||||||||
... | PartyName | ParentLocatedParty | |||||||||||||||||||||
... | Name | PartyName | |||||||||||||||||||||
... | Name | ||||||||||||||||||||||
zip | file name | entry | |||||||||||||||||||||
PlataformasAgregadasSinMenores_2018.zip | PlataformasAgregadasSinMenores_20180217_180137_1.atom | 453 | https://contrataciondelestado.es/sindicacion/P... | Expediente: 1284/17, Entidad: Diputación Provi... | Refuerzo de Firme en la VP 3001 Renedo de Esgu... | 1284/17 | Diputación Provincial de Valladolid | Refuerzo de Firme en la VP 3001 Renedo de Esgu... | 3.0 | 89917.95 | 89917.95 | [45233142.0] | ... | L02000047 | [1.0] | <NA> | 2017-11-02 23:59:00+00:00 | <NA> | <NA> | <NA> | [2018-01-02 08:01:52.024000+00:00] | [RES] | NaT |
452 | https://contrataciondelestado.es/sindicacion/P... | Expediente: 1282/17, Entidad: Diputación Provi... | Refuerzo de Firme en la VP 6603 Mota del Marqu... | 1282/17 | Diputación Provincial de Valladolid | Refuerzo de Firme en la VP 6603 Mota del Marqu... | 3.0 | 175708.46 | 175708.46 | [45233142.0] | ... | L02000047 | [1.0] | <NA> | 2017-11-02 23:59:00+00:00 | <NA> | <NA> | <NA> | [2018-01-02 08:02:24.833000+00:00] | [RES] | NaT |
2 rows × 43 columns
New data
The directory for the zip file whose data is to appended to the already existing pd.DataFrame
= history_directory
zip_directory # zip_directory = pathlib.Path.cwd() / 'data' / 'agregados'
assert zip_directory.exists()
print(zip_directory)
/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples
The particular file
= zip_directory / 'PlataformasAgregadasSinMenores_202201_05-06.zip'
zip_file assert zip_file.exists()
print(zip_file)
/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples/PlataformasAgregadasSinMenores_202201_05-06.zip
The file is read and processed in the usual way to get a pd.DataFrame
= sproc.bundle.read_zip(zip_file, concatenate=True) new_df
Multindexed columns are built
= sproc.hier.flat_df_to_multiindexed_df(new_df) new_df
Number of levels in the columns
new_df.columns.nlevels
6
Information about deleted entries
= sproc.bundle.read_deleted_zip(zip_file) deleted_series
Merge
Old and new data are stacked together
= sproc.assemble.stack(history_df, new_df)
concatenated_df concatenated_df
id | summary | title | ContractFolderStatus | updated | ContractFolderStatus | deleted_on | |||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ContractFolderID | LocatedContractingParty | ProcurementProject | ... | LocatedContractingParty | TenderResult | LocatedContractingParty | TenderingProcess | LocatedContractingParty | TenderingProcess | LocatedContractingParty | ContractFolderStatusCode | ||||||||||||
Party | Name | TypeCode | BudgetAmount | RequiredCommodityClassification | ... | Party | AwardedTenderedProject | ParentLocatedParty | TenderSubmissionDeadlinePeriod | ParentLocatedParty | TenderSubmissionDeadlinePeriod | BuyerProfileURIID | |||||||||||
PartyName | EstimatedOverallContractAmount | TaxExclusiveAmount | ItemClassificationCode | ... | PartyIdentification | ProcurementProjectLotID | ParentLocatedParty | ParentLocatedParty | Description | ||||||||||||||
Name | ... | ID | ParentLocatedParty | ParentLocatedParty | |||||||||||||||||||
... | PartyName | ParentLocatedParty | |||||||||||||||||||||
... | Name | PartyName | |||||||||||||||||||||
... | Name | ||||||||||||||||||||||
zip | file name | entry | |||||||||||||||||||||
PlataformasAgregadasSinMenores_2018.zip | PlataformasAgregadasSinMenores_20180217_180137_1.atom | 453 | https://contrataciondelestado.es/sindicacion/P... | Expediente: 1284/17, Entidad: Diputación Provi... | Refuerzo de Firme en la VP 3001 Renedo de Esgu... | 1284/17 | Diputación Provincial de Valladolid | Refuerzo de Firme en la VP 3001 Renedo de Esgu... | 3.0 | 89917.95 | 89917.95 | [45233142.0] | ... | L02000047 | [1.0] | <NA> | 2017-11-02 23:59:00+00:00 | <NA> | <NA> | <NA> | [2018-01-02 08:01:52.024000+00:00] | [RES] | NaT |
452 | https://contrataciondelestado.es/sindicacion/P... | Expediente: 1282/17, Entidad: Diputación Provi... | Refuerzo de Firme en la VP 6603 Mota del Marqu... | 1282/17 | Diputación Provincial de Valladolid | Refuerzo de Firme en la VP 6603 Mota del Marqu... | 3.0 | 175708.46 | 175708.46 | [45233142.0] | ... | L02000047 | [1.0] | <NA> | 2017-11-02 23:59:00+00:00 | <NA> | <NA> | <NA> | [2018-01-02 08:02:24.833000+00:00] | [RES] | NaT | ||
451 | https://contrataciondelestado.es/sindicacion/P... | Expediente: 1281/17, Entidad: Diputación Provi... | Refuerzo de firme en la VP 4013 Melgar de Arri... | 1281/17 | Diputación Provincial de Valladolid | Refuerzo de firme en la VP 4013 Melgar de Arri... | 3.0 | 229259.52 | 229259.52 | [45233142.0] | ... | L02000047 | [1.0] | <NA> | 2017-11-02 23:59:00+00:00 | <NA> | <NA> | <NA> | [2018-01-02 08:02:51.744000+00:00] | [RES] | NaT | ||
448 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: 2017017; Órgano de Contratació... | Desarrollo del programa de intervención socioe... | 2017017 | Alcalde del Ayuntamiento de Eibar | Desarrollo del programa de intervención socioe... | 2.0 | 704145.00 | 361100.00 | [85310000.0] | ... | <NA> | [nan] | <NA> | 2017-10-13 19:00:00+00:00 | <NA> | <NA> | <NA> | [2018-01-02 09:25:52.396000+00:00] | [RES] | NaT | ||
447 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: B2017002; Órgano de Contrataci... | STAND DE EUSKADI EN FITUR Y SUS POSIBLES ADAP... | B2017002 | Dirección general de BASQUETOUR | Diseño, construcción en régimen de alquiler, t... | 2.0 | 1150000.00 | 175000.00 | [39154100.0] | ... | <NA> | [nan] | <NA> | 2017-10-23 14:00:00+00:00 | <NA> | <NA> | <NA> | [2018-01-02 09:25:52.501000+00:00] | [RES] | NaT | ||
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
PlataformasAgregadasSinMenores_202201_05-06.zip | PlataformasAgregadasSinMenores_20220106_030013.atom | 471 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: 2021/57-12334; Órgano de Contra... | L'objecte d'aquest contracte és la prestació c... | 2021/57-12334 | Ajuntament de Palau-solità i Plegamans | L'objecte d'aquest contracte és la prestació c... | 1.0 | 77695.74 | 77695.74 | 35120000 | ... | <NA> | NaN | <NA> | 2022-01-19 15:00:00+00:00 | <NA> | <NA> | https://contractaciopublica.gencat.cat/ecofin_... | [2022-01-04 12:12:09.949000+00:00] | [PUB] | NaT |
472 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: 32/2021; Órgano de Contratación... | Licitació per procediment obert harmonitzat i ... | 32/2021 | Consell Comarcal del Baix Llobregat | Licitació per procediment obert harmonitzat i ... | 2.0 | 16274569.09 | 6027618.18 | 60100000 | ... | <NA> | [[1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2... | <NA> | 2021-05-07 15:00:00+00:00 | <NA> | <NA> | https://contractaciopublica.gencat.cat/ecofin_... | [2022-01-04 12:12:09.742000+00:00] | [RES] | NaT | ||
473 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: 1977/2021; Órgano de Contrataci... | L'objecte d'aquest contracte és la prestació d... | 1977/2021 | Ajuntament de Sant Adrià de Besòs | L'objecte d'aquest contracte és la prestació d... | 2.0 | 76507.20 | 34776.00 | 50710000 | ... | <NA> | [[1, 2]] | <NA> | 2021-06-30 23:59:00+00:00 | <NA> | <NA> | https://contractaciopublica.gencat.cat/ecofin_... | [2022-01-04 12:12:09.666000+00:00] | [RES] | NaT | ||
474 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: EXI-2022-7; Órgano de Contratac... | Servei de trasllat de béns mobles( mobiliari, ... | EXI-2022-7 | Departament d'Acció Exterior i Govern Obert | Servei de trasllat de béns mobles( mobiliari, ... | 2.0 | 38111.04 | 15879.60 | 60161000 | ... | <NA> | NaN | <NA> | 2021-12-13 13:00:00+00:00 | <NA> | <NA> | https://contractaciopublica.gencat.cat/ecofin_... | [2022-01-04 12:12:09.602000+00:00] | [ADJ] | NaT | ||
475 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: CONTR/2021/000000137; Órgano de... | Servei de redacció del Projecte Executiu d'obr... | CONTR/2021/000000137 | Institut d'Assistència Sanitària (IAS) | Servei de redacció del Projecte Executiu d'obr... | 2.0 | 40374.94 | 33645.78 | 71000000 | ... | <NA> | NaN | <NA> | 2021-11-10 18:00:00+00:00 | <NA> | <NA> | https://contractaciopublica.gencat.cat/ecofin_... | [2022-01-04 12:12:09.536000+00:00] | [ADJ] | NaT |
971 rows × 43 columns
Only the last update is kept
= sproc.postprocess.keep_updates_only(concatenated_df)
concatenated_df 2) concatenated_df.head(
[('updated', '', '', '', '', '', '', ''), ('ContractFolderStatus', 'ContractFolderStatusCode', '', '', '', '', '', '')]
id | summary | title | ContractFolderStatus | deleted_on | updated | ContractFolderStatus | |||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ContractFolderID | LocatedContractingParty | ProcurementProject | ... | LocatedContractingParty | TenderResult | LocatedContractingParty | TenderingProcess | LocatedContractingParty | TenderingProcess | LocatedContractingParty | ContractFolderStatusCode | ||||||||||||
Party | Name | TypeCode | BudgetAmount | RequiredCommodityClassification | ... | Party | AwardedTenderedProject | ParentLocatedParty | TenderSubmissionDeadlinePeriod | ParentLocatedParty | TenderSubmissionDeadlinePeriod | BuyerProfileURIID | |||||||||||
PartyName | EstimatedOverallContractAmount | TaxExclusiveAmount | ItemClassificationCode | ... | PartyIdentification | ProcurementProjectLotID | ParentLocatedParty | ParentLocatedParty | Description | ||||||||||||||
Name | ... | ID | ParentLocatedParty | ParentLocatedParty | |||||||||||||||||||
... | PartyName | ParentLocatedParty | |||||||||||||||||||||
... | Name | PartyName | |||||||||||||||||||||
... | Name | ||||||||||||||||||||||
zip | file name | entry | |||||||||||||||||||||
PlataformasAgregadasSinMenores_2018.zip | PlataformasAgregadasSinMenores_20180217_180137_1.atom | 453 | https://contrataciondelestado.es/sindicacion/P... | Expediente: 1284/17, Entidad: Diputación Provi... | Refuerzo de Firme en la VP 3001 Renedo de Esgu... | 1284/17 | Diputación Provincial de Valladolid | Refuerzo de Firme en la VP 3001 Renedo de Esgu... | 3.0 | 89917.95 | 89917.95 | [45233142.0] | ... | L02000047 | [1.0] | <NA> | 2017-11-02 23:59:00+00:00 | <NA> | <NA> | <NA> | NaT | [2018-01-02 08:01:52.024000+00:00] | [RES] |
452 | https://contrataciondelestado.es/sindicacion/P... | Expediente: 1282/17, Entidad: Diputación Provi... | Refuerzo de Firme en la VP 6603 Mota del Marqu... | 1282/17 | Diputación Provincial de Valladolid | Refuerzo de Firme en la VP 6603 Mota del Marqu... | 3.0 | 175708.46 | 175708.46 | [45233142.0] | ... | L02000047 | [1.0] | <NA> | 2017-11-02 23:59:00+00:00 | <NA> | <NA> | <NA> | NaT | [2018-01-02 08:02:24.833000+00:00] | [RES] |
2 rows × 43 columns
In order to make it easy to update the deleted_on
column in the existing pd.DataFrame
using the data contained in the deleted pd.Series
(i.e., to get automatic alignment) we - reindex the DataFrame
using id
= concatenated_df.reset_index().set_index(['id'])
reindexed_concatenated_df 2) reindexed_concatenated_df.head(
zip | file name | entry | summary | title | ContractFolderStatus | deleted_on | updated | ContractFolderStatus | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ContractFolderID | LocatedContractingParty | ProcurementProject | ... | LocatedContractingParty | TenderResult | LocatedContractingParty | TenderingProcess | LocatedContractingParty | TenderingProcess | LocatedContractingParty | ContractFolderStatusCode | ||||||||||
Party | Name | TypeCode | BudgetAmount | ... | Party | AwardedTenderedProject | ParentLocatedParty | TenderSubmissionDeadlinePeriod | ParentLocatedParty | TenderSubmissionDeadlinePeriod | BuyerProfileURIID | ||||||||||
PartyName | EstimatedOverallContractAmount | ... | PartyIdentification | ProcurementProjectLotID | ParentLocatedParty | ParentLocatedParty | Description | ||||||||||||||
Name | ... | ID | ParentLocatedParty | ParentLocatedParty | |||||||||||||||||
... | PartyName | ParentLocatedParty | |||||||||||||||||||
... | Name | PartyName | |||||||||||||||||||
... | Name | ||||||||||||||||||||
id | |||||||||||||||||||||
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1992428 | PlataformasAgregadasSinMenores_2018.zip | PlataformasAgregadasSinMenores_20180217_180137... | 453 | Expediente: 1284/17, Entidad: Diputación Provi... | Refuerzo de Firme en la VP 3001 Renedo de Esgu... | 1284/17 | Diputación Provincial de Valladolid | Refuerzo de Firme en la VP 3001 Renedo de Esgu... | 3.0 | 89917.95 | ... | L02000047 | [1.0] | <NA> | 2017-11-02 23:59:00+00:00 | <NA> | <NA> | <NA> | NaT | [2018-01-02 08:01:52.024000+00:00] | [RES] |
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1992427 | PlataformasAgregadasSinMenores_2018.zip | PlataformasAgregadasSinMenores_20180217_180137... | 452 | Expediente: 1282/17, Entidad: Diputación Provi... | Refuerzo de Firme en la VP 6603 Mota del Marqu... | 1282/17 | Diputación Provincial de Valladolid | Refuerzo de Firme en la VP 6603 Mota del Marqu... | 3.0 | 175708.46 | ... | L02000047 | [1.0] | <NA> | 2017-11-02 23:59:00+00:00 | <NA> | <NA> | <NA> | NaT | [2018-01-02 08:02:24.833000+00:00] | [RES] |
2 rows × 45 columns
- drop the first level of the series (which leaves only
id
)
# reindexed_deleted_series = deleted_series.droplevel(0)
= deleted_series.droplevel([0, 1])
reindexed_deleted_series 2) reindexed_deleted_series.head(
id
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/6724977 2022-01-04 00:12:01.376000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968315 2022-01-03 23:11:57.567000+00:00
Name: deleted_on, dtype: datetime64[ns, UTC]
Duplicates are removed from the above pd.Series
= sproc.postprocess.deduplicate_deleted_series(reindexed_deleted_series)
deduplicated_reindexed_deleted_series deduplicated_reindexed_deleted_series
id
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1957921 2022-01-03 23:11:56.535000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1959074 2022-01-03 23:11:56.497000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968266 2022-01-04 23:12:11.916000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968267 2022-01-04 23:12:12.775000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968268 2022-01-04 23:12:12.731000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968269 2022-01-04 23:12:13.013000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968271 2022-01-04 23:12:12.172000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968272 2022-01-04 23:12:12.957000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968279 2022-01-04 23:12:12.039000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968280 2022-01-04 23:12:12.865000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968282 2022-01-04 23:12:13.107000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968284 2022-01-04 23:12:12.911000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968285 2022-01-04 23:12:12.819000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968286 2022-01-04 23:12:13.147000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968287 2022-01-04 23:12:11.998000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968288 2022-01-04 23:12:11.957000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968289 2022-01-04 23:12:12.129000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968290 2022-01-04 23:12:12.635000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968291 2022-01-04 23:12:13.191000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968294 2022-01-04 23:12:13.328000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968299 2022-01-04 23:12:13.236000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968301 2022-01-04 23:12:13.283000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968302 2022-01-04 23:12:12.680000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968303 2022-01-04 23:12:13.058000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968305 2022-01-04 23:12:12.076000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968315 2022-01-03 23:11:57.567000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968317 2022-01-03 23:11:56.751000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968318 2022-01-03 23:11:57.038000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968319 2022-01-03 23:11:56.620000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968321 2022-01-03 23:11:57.323000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968324 2022-01-03 23:11:57.471000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968325 2022-01-03 23:11:56.579000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968327 2022-01-03 23:11:57.089000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968329 2022-01-03 23:11:56.850000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968331 2022-01-03 23:11:56.897000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968332 2022-01-03 23:11:56.949000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968333 2022-01-03 23:11:57.183000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968335 2022-01-03 23:11:57.275000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968337 2022-01-03 23:11:57.375000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968338 2022-01-03 23:11:56.805000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968343 2022-01-03 23:11:57.232000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968344 2022-01-03 23:11:57.426000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968345 2022-01-03 23:11:57.516000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968346 2022-01-03 23:11:56.661000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968358 2022-01-03 23:11:56.706000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968360 2022-01-03 23:11:57.141000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1968361 2022-01-03 23:11:56.991000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/6724977 2022-01-04 00:12:01.376000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/8892704 2022-01-04 13:13:12.670000+00:00
Name: deleted_on, dtype: datetime64[ns, UTC]
The old and new sizes
reindexed_deleted_series.shape, deduplicated_reindexed_deleted_series.shape
((49,), (49,))
No keys are lost along the way
assert (set(reindexed_deleted_series.index) - set(deduplicated_reindexed_deleted_series.index)) == set()
Duplicates in the old pd.Series
reindexed_deleted_series.loc[reindexed_deleted_series.index.duplicated()]
Series([], Name: deleted_on, dtype: datetime64[ns, UTC])
There should be none in the new
assert deduplicated_reindexed_deleted_series.loc[deduplicated_reindexed_deleted_series.index.duplicated()].empty
What is the intersection between the entries in the new pd.DataFrame
and the pd.Series
with information about deletes (only the first elements).
list(itertools.islice(set(reindexed_concatenated_df.index) & set(deduplicated_reindexed_deleted_series.index), 0, 4))
[]
A deleted entry, identified by its id
, and also showing the source ATOM file
= deduplicated_reindexed_deleted_series.index[0]
deleted_entry deleted_entry
'https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1957921'
It’s not actually present in the data
'id'] == deleted_entry).sum() (new_df[
0
In order to see if something changes when updating column deleted_on
, a backup of the old pd.DataFrame
is made
# reindexed_concatenated_df = bak_reindexed_concatenated_df.copy()
= reindexed_concatenated_df.copy() bak_reindexed_concatenated_df
The columns deleted_on
in the pd.DataFrame
is updated for the indexes present in the deleted series. Notice that we cannot do
reindexed_concatenated_df['deleted_on'] = deduplicated_reindexed_deleted_series
because that would effectively wipe out the values of all the entries not present in the series.
na’s in the existing pd.DataFrame
are filled in with data from the deleted series
'deleted_on'] = reindexed_concatenated_df['deleted_on'].fillna(deduplicated_reindexed_deleted_series) reindexed_concatenated_df[
The number of non-NAs in the old and new pd.DataFrame
s
= bak_reindexed_concatenated_df['deleted_on'].notna().sum(), reindexed_concatenated_df['deleted_on'].notna().sum()
n_notnas_old, n_notnas_new n_notnas_old, n_notnas_new
(0, 0)
The latter be larger than or equal than the former
assert n_notnas_new >= n_notnas_old
The original index is reset (notice the reset_index()
in the middle that avoids losing id
columns, which is right now the index)
= reindexed_concatenated_df.reset_index().set_index(['zip', 'file name', 'entry']) reindexed_concatenated_df
Finally
2) reindexed_concatenated_df.head(
id | summary | title | ContractFolderStatus | deleted_on | updated | ContractFolderStatus | |||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ContractFolderID | LocatedContractingParty | ProcurementProject | ... | LocatedContractingParty | TenderResult | LocatedContractingParty | TenderingProcess | LocatedContractingParty | TenderingProcess | LocatedContractingParty | ContractFolderStatusCode | ||||||||||||
Party | Name | TypeCode | BudgetAmount | RequiredCommodityClassification | ... | Party | AwardedTenderedProject | ParentLocatedParty | TenderSubmissionDeadlinePeriod | ParentLocatedParty | TenderSubmissionDeadlinePeriod | BuyerProfileURIID | |||||||||||
PartyName | EstimatedOverallContractAmount | TaxExclusiveAmount | ItemClassificationCode | ... | PartyIdentification | ProcurementProjectLotID | ParentLocatedParty | ParentLocatedParty | Description | ||||||||||||||
Name | ... | ID | ParentLocatedParty | ParentLocatedParty | |||||||||||||||||||
... | PartyName | ParentLocatedParty | |||||||||||||||||||||
... | Name | PartyName | |||||||||||||||||||||
... | Name | ||||||||||||||||||||||
zip | file name | entry | |||||||||||||||||||||
PlataformasAgregadasSinMenores_2018.zip | PlataformasAgregadasSinMenores_20180217_180137_1.atom | 453 | https://contrataciondelestado.es/sindicacion/P... | Expediente: 1284/17, Entidad: Diputación Provi... | Refuerzo de Firme en la VP 3001 Renedo de Esgu... | 1284/17 | Diputación Provincial de Valladolid | Refuerzo de Firme en la VP 3001 Renedo de Esgu... | 3.0 | 89917.95 | 89917.95 | [45233142.0] | ... | L02000047 | [1.0] | <NA> | 2017-11-02 23:59:00+00:00 | <NA> | <NA> | <NA> | NaT | [2018-01-02 08:01:52.024000+00:00] | [RES] |
452 | https://contrataciondelestado.es/sindicacion/P... | Expediente: 1282/17, Entidad: Diputación Provi... | Refuerzo de Firme en la VP 6603 Mota del Marqu... | 1282/17 | Diputación Provincial de Valladolid | Refuerzo de Firme en la VP 6603 Mota del Marqu... | 3.0 | 175708.46 | 175708.46 | [45233142.0] | ... | L02000047 | [1.0] | <NA> | 2017-11-02 23:59:00+00:00 | <NA> | <NA> | <NA> | NaT | [2018-01-02 08:02:24.833000+00:00] | [RES] |
2 rows × 43 columns
A function encompassing all of the above steps in one go
df_with_zip
df_with_zip (history_df:pandas.core.frame.DataFrame, zip_file:str|pathlib.Path, output_file:str|pathlib.Path|None=None)
Extend an existing DataFrame with the data in a zip file
Type | Default | Details | |
---|---|---|---|
history_df | DataFrame | DataFrame to be extended | |
zip_file | str | pathlib.Path | Zip file with new data | |
output_file | str | pathlib.Path | None | None | Output file (optional) |
Returns | None | pandas.core.frame.DataFrame | Extended DataFrame or nothing if an output_file was passed |
For historical reasons, a slightly higher-level function is included
parquet_with_zip
parquet_with_zip (history_file:str|pathlib.Path, zip_file:str|pathlib.Path, output_file:str|pathlib.Path|None=None)
Extend an existing parquet file with the data in a zip file
Type | Default | Details | |
---|---|---|---|
history_file | str | pathlib.Path | DataFrame to be extended | |
zip_file | str | pathlib.Path | Zip file with new data | |
output_file | str | pathlib.Path | None | None | Output file (optional) |
Returns | None | pandas.core.frame.DataFrame | Extended DataFrame or nothing if an output_file was passed |
= parquet_with_zip(history_input_file, zip_file)
merged_df 2) merged_df.head(
[('updated', '', '', '', '', '', '', ''), ('ContractFolderStatus', 'ContractFolderStatusCode', '', '', '', '', '', '')]
id | summary | title | ContractFolderStatus | deleted_on | updated | ContractFolderStatus | |||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ContractFolderID | LocatedContractingParty | ProcurementProject | ... | LocatedContractingParty | TenderResult | LocatedContractingParty | TenderingProcess | LocatedContractingParty | TenderingProcess | LocatedContractingParty | ContractFolderStatusCode | ||||||||||||
Party | Name | TypeCode | BudgetAmount | RequiredCommodityClassification | ... | Party | AwardedTenderedProject | ParentLocatedParty | TenderSubmissionDeadlinePeriod | ParentLocatedParty | TenderSubmissionDeadlinePeriod | BuyerProfileURIID | |||||||||||
PartyName | EstimatedOverallContractAmount | TaxExclusiveAmount | ItemClassificationCode | ... | PartyIdentification | ProcurementProjectLotID | ParentLocatedParty | ParentLocatedParty | Description | ||||||||||||||
Name | ... | ID | ParentLocatedParty | ParentLocatedParty | |||||||||||||||||||
... | PartyName | ParentLocatedParty | |||||||||||||||||||||
... | Name | PartyName | |||||||||||||||||||||
... | Name | ||||||||||||||||||||||
zip | file name | entry | |||||||||||||||||||||
PlataformasAgregadasSinMenores_2018.zip | PlataformasAgregadasSinMenores_20180217_180137_1.atom | 453 | https://contrataciondelestado.es/sindicacion/P... | Expediente: 1284/17, Entidad: Diputación Provi... | Refuerzo de Firme en la VP 3001 Renedo de Esgu... | 1284/17 | Diputación Provincial de Valladolid | Refuerzo de Firme en la VP 3001 Renedo de Esgu... | 3.0 | 89917.95 | 89917.95 | [45233142.0] | ... | L02000047 | [1.0] | <NA> | 2017-11-02 23:59:00+00:00 | <NA> | <NA> | <NA> | NaT | [2018-01-02 08:01:52.024000+00:00] | [RES] |
452 | https://contrataciondelestado.es/sindicacion/P... | Expediente: 1282/17, Entidad: Diputación Provi... | Refuerzo de Firme en la VP 6603 Mota del Marqu... | 1282/17 | Diputación Provincial de Valladolid | Refuerzo de Firme en la VP 6603 Mota del Marqu... | 3.0 | 175708.46 | 175708.46 | [45233142.0] | ... | L02000047 | [1.0] | <NA> | 2017-11-02 23:59:00+00:00 | <NA> | <NA> | <NA> | NaT | [2018-01-02 08:02:24.833000+00:00] | [RES] |
2 rows × 43 columns
We should be getting exactly the same result as above
assert reindexed_concatenated_df.equals(merged_df)
= history_directory / 'merged.parquet'
output_file parquet_with_zip(history_input_file, zip_file, output_file)
[('updated', '', '', '', '', '', '', ''), ('ContractFolderStatus', 'ContractFolderStatusCode', '', '', '', '', '', '')]
= pd.read_parquet(output_file)
sample_output_df 2) sample_output_df.head(
id | summary | title | ContractFolderStatus | deleted_on | updated | ContractFolderStatus | |||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ContractFolderID | LocatedContractingParty | ProcurementProject | ... | LocatedContractingParty | TenderResult | LocatedContractingParty | TenderingProcess | LocatedContractingParty | TenderingProcess | LocatedContractingParty | ContractFolderStatusCode | ||||||||||||
Party | Name | TypeCode | BudgetAmount | RequiredCommodityClassification | ... | Party | AwardedTenderedProject | ParentLocatedParty | TenderSubmissionDeadlinePeriod | ParentLocatedParty | TenderSubmissionDeadlinePeriod | BuyerProfileURIID | |||||||||||
PartyName | EstimatedOverallContractAmount | TaxExclusiveAmount | ItemClassificationCode | ... | PartyIdentification | ProcurementProjectLotID | ParentLocatedParty | ParentLocatedParty | Description | ||||||||||||||
Name | ... | ID | ParentLocatedParty | ParentLocatedParty | |||||||||||||||||||
... | PartyName | ParentLocatedParty | |||||||||||||||||||||
... | Name | PartyName | |||||||||||||||||||||
... | Name | ||||||||||||||||||||||
zip | file name | entry | |||||||||||||||||||||
PlataformasAgregadasSinMenores_2018.zip | PlataformasAgregadasSinMenores_20180217_180137_1.atom | 453 | https://contrataciondelestado.es/sindicacion/P... | Expediente: 1284/17, Entidad: Diputación Provi... | Refuerzo de Firme en la VP 3001 Renedo de Esgu... | 1284/17 | Diputación Provincial de Valladolid | Refuerzo de Firme en la VP 3001 Renedo de Esgu... | 3.0 | 89917.95 | 89917.95 | [45233142.0] | ... | L02000047 | [1.0] | <NA> | 2017-11-02 23:59:00+00:00 | <NA> | <NA> | <NA> | None | [2018-01-02 08:01:52.024000+00:00] | [RES] |
452 | https://contrataciondelestado.es/sindicacion/P... | Expediente: 1282/17, Entidad: Diputación Provi... | Refuerzo de Firme en la VP 6603 Mota del Marqu... | 1282/17 | Diputación Provincial de Valladolid | Refuerzo de Firme en la VP 6603 Mota del Marqu... | 3.0 | 175708.46 | 175708.46 | [45233142.0] | ... | L02000047 | [1.0] | <NA> | 2017-11-02 23:59:00+00:00 | <NA> | <NA> | <NA> | None | [2018-01-02 08:02:24.833000+00:00] | [RES] |
2 rows × 43 columns
Let us make sure nothing was lost
assert sample_output_df.shape == merged_df.shape