import sproc.xmlpostprocess
In order to avoid circular dependencies in the resulting Python modules, and since this is only for testing
Sample data
Directory where the data (XML files) are stored
directory = pathlib.Path.cwd().parent / 'samples'
assert directory.exists()
directoryPosixPath('/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples')
A (sample) file in that directory
xml_file = directory / 'PlataformasAgregadasSinMenores_20220104_030016_1.atom'
assert xml_file.exists()
xml_filePosixPath('/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples/PlataformasAgregadasSinMenores_20220104_030016_1.atom')
df = sproc.xml.to_df(xml_file)
df| id | summary | title | updated | ContractFolderStatus - ContractFolderID | ContractFolderStatus - ContractFolderStatusCode | ContractFolderStatus - LocatedContractingParty - BuyerProfileURIID | ContractFolderStatus - LocatedContractingParty - Party - PartyName - Name | ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - PartyName - Name | ContractFolderStatus - ProcurementProject - Name | ... | ContractFolderStatus - LegalDocumentReference - Attachment - ExternalReference - URI | ContractFolderStatus - TechnicalDocumentReference - ID | ContractFolderStatus - TechnicalDocumentReference - Attachment - ExternalReference - URI | ContractFolderStatus - ProcurementProject - PlannedPeriod - StartDate | ContractFolderStatus - ProcurementProject - PlannedPeriod - EndDate | ContractFolderStatus - LocatedContractingParty - Party - PartyIdentification - ID | ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - ParentLocatedParty - PartyName - Name | ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndDate | ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndTime | ContractFolderStatus - TenderResult - AwardedTenderedProject - ProcurementProjectLotID | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: C. 2-2021; Órgano de Contrataci... | L'objecte del contracte és la renovació de tot... | 2022-01-03T01:11:41.826+01:00 | C. 2-2021 | ADJ | https://contractaciopublica.gencat.cat/ecofin_... | Ajuntament de Sant Ramon | Entitats municipals de Catalunya | L'objecte del contracte és la renovació de tot... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: 8128_3/2021; Órgano de Contrata... | Obras de restauración hidromorfológica del río... | 2022-01-03T01:00:11.194+01:00 | 8128_3/2021 | PUB | NaN | Pleno del Ayuntamiento | AYUNTAMIENTO DE MONREAL | Obras de restauración hidromorfológica del río... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: 1000_0005-CP01-2021-000063; Órg... | Contrato del servicio de realización de labore... | 2022-01-03T01:00:10.399+01:00 | 1000_0005-CP01-2021-000063 | EV | NaN | El Director General de Comunicación y Relacion... | Departamento de Presidencia, Igualdad, Función... | Contrato del servicio de realización de labore... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: 1379/2020 4738; Órgano de Contr... | Obres de renovació de l'enllumenat públic a la... | 2022-01-03T00:11:40.740+01:00 | 1379/2020 4738 | EV | https://contractaciopublica.gencat.cat/ecofin_... | Ajuntament de Canet de Mar | Entitats municipals de Catalunya | Obres de renovació de l'enllumenat públic a la... | ... | https://contractaciopublica.gencat.cat/ecofin_... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: 2021-44; Órgano de Contratación... | Subministre i la instal·lació fotovoltaica en ... | 2022-01-03T00:11:40.696+01:00 | 2021-44 | EV | https://contractaciopublica.gencat.cat/ecofin_... | Ajuntament de Valls | Entitats municipals de Catalunya | Subministre i la instal·lació fotovoltaica en ... | ... | https://contractaciopublica.gencat.cat/ecofin_... | Enllac plec clausules tecniques.doc | https://contractaciopublica.gencat.cat/ecofin_... | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 112 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: 1005_391-2021; Órgano de Contra... | Apoyo a la gestión del patrimonio filmográfico... | 2021-12-31T01:00:14.946+01:00 | 1005_391-2021 | PUB | NaN | Dirección General de Cultura-Institución Prínc... | Departamento de Cultura, Deporte y Juventud | Apoyo a la gestión del patrimonio filmográfico... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 113 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: 8165_3/2021; Órgano de Contrata... | Asistencia técnica para la prestación del serv... | 2021-12-31T01:00:14.393+01:00 | 8165_3/2021 | EV | NaN | Mancomunidad de Servicios Sociales de Base de ... | MANCOMUNIDAD DE SERVICIOS DE HUARTE Y DE ESTER... | Asistencia técnica para la prestación del serv... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 114 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: 8113_3/2021; Órgano de Contrata... | Contrato de servicios de desinfección, desinse... | 2021-12-31T01:00:13.594+01:00 | 8113_3/2021 | EV | NaN | Subdirector de Gestión y Recursos | Agencia Navarra para la Dependencia | Contrato de servicios de desinfección, desinse... | ... | NaN | NaN | NaN | 2022-01-01 | 2022-12-31 | NaN | NaN | NaN | NaN | NaN |
| 115 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: 8113_01 2021; Órgano de Contrat... | Contrato del Servicio de Teleasistencia para l... | 2021-12-31T01:00:12.604+01:00 | 8113_01 2021 | EV | NaN | Agencia Navarra de Autonomía y Desarrollo de l... | Agencia Navarra para la Dependencia | Contrato del Servicio de Teleasistencia para l... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 116 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: 0001264/2021; Órgano de contrat... | 2021/pa-44-4 servicio de mantenimiento integra... | 2021-12-31T00:14:15.739+01:00 | 0001264/2021 | RES | NaN | Agencia Pública Empresarial Sanitaria Bajo Gua... | Junta de Andalucía | 2021/pa-44-4 servicio de mantenimiento integra... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
117 rows × 38 columns
There are some multivalued columns
sproc.structure.multivalued_columns(df)['ContractFolderStatus - ProcurementProject - RequiredCommodityClassification - ItemClassificationCode',
'ContractFolderStatus - TenderResult - ResultCode',
'ContractFolderStatus - TenderResult - ReceivedTenderQuantity',
'ContractFolderStatus - TenderResult - WinningParty - PartyIdentification - ID',
'ContractFolderStatus - TenderResult - WinningParty - PartyName - Name',
'ContractFolderStatus - TenderResult - AwardedTenderedProject - LegalMonetaryTotal - TaxExclusiveAmount',
'ContractFolderStatus - ValidNoticeInfo - NoticeTypeCode',
'ContractFolderStatus - ValidNoticeInfo - AdditionalPublicationStatus - PublicationMediaName',
'ContractFolderStatus - ValidNoticeInfo - AdditionalPublicationStatus - AdditionalPublicationDocumentReference - IssueDate',
'ContractFolderStatus - TenderResult - AwardedTenderedProject - ProcurementProjectLotID']
Columns’ types
A regular expression matching any column whose final component is PostalZone
For instance
re_postal_zone.match('foo - PostalZone')<re.Match object; span=(0, 16), match='foo - PostalZone'>
but
re_postal_zone.match('Address - Number')Two positives
assert re_postal_zone.match(sproc.structure.assemble_name(['ContractFolderStatus', 'ProcurementProject', 'RealizedLocation', 'Address', 'PostalZone']))
assert re_postal_zone.match(sproc.structure.assemble_name(['ContractFolderStatus', 'LocatedContractingParty', 'Party', 'PostalAddress', 'PostalZone']))A negative
assert not re_postal_zone.match(sproc.structure.assemble_name(['ContractFolderStatus', 'Wap']))The list below specifies fields that are to be interpreted as str
str_columns[:2][]
For easier processing, the lists are turned into str
assembled_str_columns[]
These columns, if present,
are parsed jointly into a new column
The column indicating the status is also gonna be exploited
A function to tidy up some things in the pd.DataFrames returned by sproc.xml.to_df
typecast_columns
typecast_columns (input_df:pandas.core.frame.DataFrame)
Tidy up the pd.DataFrame returned by to_df
| Type | Details | |
|---|---|---|
| input_df | DataFrame | Input DataFrame as ready by to_df |
| Returns | DataFrame | Post-processed DataFrame |
post_df = typecast_columns(df)
post_df.head()| id | summary | title | updated | ContractFolderStatus - ContractFolderID | ContractFolderStatus - ContractFolderStatusCode | ContractFolderStatus - LocatedContractingParty - BuyerProfileURIID | ContractFolderStatus - LocatedContractingParty - Party - PartyName - Name | ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - PartyName - Name | ContractFolderStatus - ProcurementProject - Name | ... | ContractFolderStatus - TechnicalDocumentReference - ID | ContractFolderStatus - TechnicalDocumentReference - Attachment - ExternalReference - URI | ContractFolderStatus - ProcurementProject - PlannedPeriod - StartDate | ContractFolderStatus - ProcurementProject - PlannedPeriod - EndDate | ContractFolderStatus - LocatedContractingParty - Party - PartyIdentification - ID | ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - ParentLocatedParty - PartyName - Name | ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndDate | ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndTime | ContractFolderStatus - TenderResult - AwardedTenderedProject - ProcurementProjectLotID | ContractFolderStatus - TenderingProcess - TenderSubmissionDeadlinePeriod | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: C. 2-2021; Órgano de Contrataci... | L'objecte del contracte és la renovació de tot... | [2022-01-03 00:11:41.826000+00:00] | C. 2-2021 | [ADJ] | https://contractaciopublica.gencat.cat/ecofin_... | Ajuntament de Sant Ramon | Entitats municipals de Catalunya | L'objecte del contracte és la renovació de tot... | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | 2021-12-17 14:00:00+00:00 |
| 1 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: 8128_3/2021; Órgano de Contrata... | Obras de restauración hidromorfológica del río... | [2022-01-03 00:00:11.194000+00:00] | 8128_3/2021 | [PUB] | <NA> | Pleno del Ayuntamiento | AYUNTAMIENTO DE MONREAL | Obras de restauración hidromorfológica del río... | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | 2022-01-22 23:30:00+00:00 |
| 2 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: 1000_0005-CP01-2021-000063; Órg... | Contrato del servicio de realización de labore... | [2022-01-03 00:00:10.399000+00:00] | 1000_0005-CP01-2021-000063 | [EV] | <NA> | El Director General de Comunicación y Relacion... | Departamento de Presidencia, Igualdad, Función... | Contrato del servicio de realización de labore... | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | NaT |
| 3 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: 1379/2020 4738; Órgano de Contr... | Obres de renovació de l'enllumenat públic a la... | [2022-01-02 23:11:40.740000+00:00] | 1379/2020 4738 | [EV] | https://contractaciopublica.gencat.cat/ecofin_... | Ajuntament de Canet de Mar | Entitats municipals de Catalunya | Obres de renovació de l'enllumenat públic a la... | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | 2022-01-02 23:59:00+00:00 |
| 4 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: 2021-44; Órgano de Contratación... | Subministre i la instal·lació fotovoltaica en ... | [2022-01-02 23:11:40.696000+00:00] | 2021-44 | [EV] | https://contractaciopublica.gencat.cat/ecofin_... | Ajuntament de Valls | Entitats municipals de Catalunya | Subministre i la instal·lació fotovoltaica en ... | ... | Enllac plec clausules tecniques.doc | https://contractaciopublica.gencat.cat/ecofin_... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | 2022-01-02 23:59:00+00:00 |
5 rows × 39 columns
After post-processing some objects have been parsed as dates.
Multivalued columns are still multivalued…but we have a couple of extras one (updated and ContractFolderStatus - ContractFolderStatusCode )
# assert sproc.structure.multivalued_columns(post_df) == sproc.structure.multivalued_columns(df) # <---------------------
assert len(sproc.structure.multivalued_columns(post_df)) == len(sproc.structure.multivalued_columns(df)) + 2Most recent update
updated, after post-processing is an object not apt for ordering
post_df['updated'].dtypedtype('O')
We order the entries by updated date (ascending order) and then group by id. Notice that as a previous step to ordering, tha maximum of all the elements in updated is computed.
grouped = post_df.sort_values('updated', key=np.vectorize(max)).groupby('id')We are interested in groups with more than one element
not_one_element_group = (grouped.size() > 1)
not_one_element_groupid
https://contrataciondelestado.es/sindicacion/P... False
https://contrataciondelestado.es/sindicacion/P... False
https://contrataciondelestado.es/sindicacion/P... False
https://contrataciondelestado.es/sindicacion/P... False
https://contrataciondelestado.es/sindicacion/P... False
...
https://contrataciondelestado.es/sindicacion/P... False
https://contrataciondelestado.es/sindicacion/P... False
https://contrataciondelestado.es/sindicacion/P... False
https://contrataciondelestado.es/sindicacion/P... False
https://contrataciondelestado.es/sindicacion/P... False
Length: 115, dtype: bool
Number of tenders with updates
not_one_element_group.sum()2
actual_groups = not_one_element_group.index[not_one_element_group]
actual_groupsIndex(['https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/8904280', 'https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/8994209'], dtype='string', name='id')
There is a group with this number of elements
grouped.size().max()2
For getting groups with exactly a certain number of elements (here \(2\) again)
size_2_groups = (grouped.size() == 2).loc[lambda x: x].index
size_2_groupsIndex(['https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/8904280', 'https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/8994209'], dtype='string', name='id')
first_group = grouped.get_group(actual_groups[0])
first_group| id | summary | title | updated | ContractFolderStatus - ContractFolderID | ContractFolderStatus - ContractFolderStatusCode | ContractFolderStatus - LocatedContractingParty - BuyerProfileURIID | ContractFolderStatus - LocatedContractingParty - Party - PartyName - Name | ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - PartyName - Name | ContractFolderStatus - ProcurementProject - Name | ... | ContractFolderStatus - TechnicalDocumentReference - ID | ContractFolderStatus - TechnicalDocumentReference - Attachment - ExternalReference - URI | ContractFolderStatus - ProcurementProject - PlannedPeriod - StartDate | ContractFolderStatus - ProcurementProject - PlannedPeriod - EndDate | ContractFolderStatus - LocatedContractingParty - Party - PartyIdentification - ID | ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - ParentLocatedParty - PartyName - Name | ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndDate | ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndTime | ContractFolderStatus - TenderResult - AwardedTenderedProject - ProcurementProjectLotID | ContractFolderStatus - TenderingProcess - TenderSubmissionDeadlinePeriod | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 32 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: 3069_1/2021; Órgano de Contrata... | Construcción de depósito regulador de 100m3 en... | [2022-01-01 00:00:16.761000+00:00] | 3069_1/2021 | [EV] | <NA> | CONCEJO DE GALBARRA | Concejo de Galbarra | Construcción de depósito regulador de 100m3 en... | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | NaT |
| 19 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: 3069_1/2021; Órgano de Contrata... | Construcción de depósito regulador de 100m3 en... | [2022-01-01 07:30:09.698000+00:00] | 3069_1/2021 | [EV] | <NA> | CONCEJO DE GALBARRA | Concejo de Galbarra | Construcción de depósito regulador de 100m3 en... | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | NaT |
2 rows × 39 columns
True if columns are different across elements
columns_are_different = first_group.iloc[0] != first_group.iloc[-1]
columns_are_differentid False
summary False
title False
updated True
ContractFolderStatus - ContractFolderID False
ContractFolderStatus - ContractFolderStatusCode False
ContractFolderStatus - LocatedContractingParty - BuyerProfileURIID True
ContractFolderStatus - LocatedContractingParty - Party - PartyName - Name False
ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - PartyName - Name False
ContractFolderStatus - ProcurementProject - Name False
ContractFolderStatus - ProcurementProject - TypeCode False
ContractFolderStatus - ProcurementProject - BudgetAmount - EstimatedOverallContractAmount False
ContractFolderStatus - ProcurementProject - BudgetAmount - TaxExclusiveAmount False
ContractFolderStatus - ProcurementProject - RequiredCommodityClassification - ItemClassificationCode False
ContractFolderStatus - ProcurementProject - RealizedLocation - CountrySubentityCode False
ContractFolderStatus - ProcurementProject - PlannedPeriod - DurationMeasure False
ContractFolderStatus - TenderResult - ResultCode True
ContractFolderStatus - TenderResult - ReceivedTenderQuantity True
ContractFolderStatus - TenderResult - WinningParty - PartyIdentification - ID True
ContractFolderStatus - TenderResult - WinningParty - PartyName - Name True
ContractFolderStatus - TenderResult - AwardedTenderedProject - LegalMonetaryTotal - TaxExclusiveAmount True
ContractFolderStatus - TenderingProcess - ProcedureCode False
ContractFolderStatus - TenderingProcess - TenderSubmissionDeadlinePeriod - EndDate True
ContractFolderStatus - TenderingProcess - TenderSubmissionDeadlinePeriod - EndTime True
ContractFolderStatus - ValidNoticeInfo - NoticeTypeCode False
ContractFolderStatus - ValidNoticeInfo - AdditionalPublicationStatus - PublicationMediaName False
ContractFolderStatus - ValidNoticeInfo - AdditionalPublicationStatus - AdditionalPublicationDocumentReference - IssueDate True
ContractFolderStatus - LegalDocumentReference - ID True
ContractFolderStatus - LegalDocumentReference - Attachment - ExternalReference - URI True
ContractFolderStatus - TechnicalDocumentReference - ID True
ContractFolderStatus - TechnicalDocumentReference - Attachment - ExternalReference - URI True
ContractFolderStatus - ProcurementProject - PlannedPeriod - StartDate True
ContractFolderStatus - ProcurementProject - PlannedPeriod - EndDate True
ContractFolderStatus - LocatedContractingParty - Party - PartyIdentification - ID True
ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - ParentLocatedParty - PartyName - Name True
ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndDate True
ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndTime True
ContractFolderStatus - TenderResult - AwardedTenderedProject - ProcurementProjectLotID True
ContractFolderStatus - TenderingProcess - TenderSubmissionDeadlinePeriod True
dtype: bool
The values in those columns are (besides updated) all nans
first_group[first_group.columns[columns_are_different]]| updated | ContractFolderStatus - LocatedContractingParty - BuyerProfileURIID | ContractFolderStatus - TenderResult - ResultCode | ContractFolderStatus - TenderResult - ReceivedTenderQuantity | ContractFolderStatus - TenderResult - WinningParty - PartyIdentification - ID | ContractFolderStatus - TenderResult - WinningParty - PartyName - Name | ContractFolderStatus - TenderResult - AwardedTenderedProject - LegalMonetaryTotal - TaxExclusiveAmount | ContractFolderStatus - TenderingProcess - TenderSubmissionDeadlinePeriod - EndDate | ContractFolderStatus - TenderingProcess - TenderSubmissionDeadlinePeriod - EndTime | ContractFolderStatus - ValidNoticeInfo - AdditionalPublicationStatus - AdditionalPublicationDocumentReference - IssueDate | ... | ContractFolderStatus - TechnicalDocumentReference - ID | ContractFolderStatus - TechnicalDocumentReference - Attachment - ExternalReference - URI | ContractFolderStatus - ProcurementProject - PlannedPeriod - StartDate | ContractFolderStatus - ProcurementProject - PlannedPeriod - EndDate | ContractFolderStatus - LocatedContractingParty - Party - PartyIdentification - ID | ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - ParentLocatedParty - PartyName - Name | ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndDate | ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndTime | ContractFolderStatus - TenderResult - AwardedTenderedProject - ProcurementProjectLotID | ContractFolderStatus - TenderingProcess - TenderSubmissionDeadlinePeriod | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 32 | [2022-01-01 00:00:16.761000+00:00] | <NA> | NaN | NaN | NaN | NaN | NaN | <NA> | <NA> | [[2021-12-16, 2021-12-16, 2021-12-16, 2021-12-... | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | NaT |
| 19 | [2022-01-01 07:30:09.698000+00:00] | <NA> | NaN | NaN | NaN | NaN | NaN | <NA> | <NA> | [[2021-12-16, 2021-12-16, 2021-12-16, 2021-12-... | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | NaT |
2 rows × 22 columns
Only the last element of each group (the most recent entry) is kept
only_last_update_df = grouped.last()
only_last_update_df.head()| summary | title | updated | ContractFolderStatus - ContractFolderID | ContractFolderStatus - ContractFolderStatusCode | ContractFolderStatus - LocatedContractingParty - BuyerProfileURIID | ContractFolderStatus - LocatedContractingParty - Party - PartyName - Name | ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - PartyName - Name | ContractFolderStatus - ProcurementProject - Name | ContractFolderStatus - ProcurementProject - TypeCode | ... | ContractFolderStatus - TechnicalDocumentReference - ID | ContractFolderStatus - TechnicalDocumentReference - Attachment - ExternalReference - URI | ContractFolderStatus - ProcurementProject - PlannedPeriod - StartDate | ContractFolderStatus - ProcurementProject - PlannedPeriod - EndDate | ContractFolderStatus - LocatedContractingParty - Party - PartyIdentification - ID | ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - ParentLocatedParty - PartyName - Name | ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndDate | ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndTime | ContractFolderStatus - TenderResult - AwardedTenderedProject - ProcurementProjectLotID | ContractFolderStatus - TenderingProcess - TenderSubmissionDeadlinePeriod | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | |||||||||||||||||||||
| https://contrataciondelestado.es/sindicacion/P... | Id licitación: 6011900114;Órgano de Contrataci... | Servicio de migración de los productos BMC ins... | [2021-12-31 10:56:15.855000+00:00] | 6011900114 | [RES] | http://www.madrid.org/cs/Satellite?cid=1204201... | Empresa Pública de Metro de Madrid, S.A. | Consejería de Transportes e Infraestructuras | Servicio de migración de los productos BMC ins... | 2.0 | ... | 1354764548025.pdf | http://www.madrid.org/contratos-publicos/13547... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | None | NaT |
| https://contrataciondelestado.es/sindicacion/P... | Id licitación: 11/2019; Ó“rgano de Contratació... | Concurso de proyectos para la rehabilitación u... | [2021-12-31 10:41:20.201000+00:00] | 11/2019 | [EV] | <NA> | Alcaldia | Ayuntamiento de Leioa | Concurso de proyectos para la rehabilitación u... | 2.0 | ... | <NA> | <NA> | <NA> | <NA> | <NA> | Ayuntamiento de Leioa | <NA> | <NA> | None | 2020-01-23 17:00:00+00:00 |
| https://contrataciondelestado.es/sindicacion/P... | Id licitación: 6012000208;Órgano de Contrataci... | Suministro de equipos de manutención (carretil... | [2021-12-31 11:11:15.851000+00:00] | 6012000208 | [RES] | http://www.madrid.org/cs/Satellite?cid=1204201... | Empresa Pública de Metro de Madrid, S.A. | Consejería de Transportes e Infraestructuras | Suministro de equipos de manutención (carretil... | 1.0 | ... | 1354836373815.pdf | http://www.madrid.org/contratos-publicos/13548... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | None | NaT |
| https://contrataciondelestado.es/sindicacion/P... | Id licitación: 6012100119;Órgano de Contrataci... | Servicio de soporte al mantenimiento de instal... | [2021-12-31 10:11:16.027000+00:00] | 6012100119 | [RES] | http://www.madrid.org/cs/Satellite?cid=1204201... | Empresa Pública de Metro de Madrid, S.A. | Consejería de Transportes e Infraestructuras | Servicio de soporte al mantenimiento de instal... | 2.0 | ... | 1354874714036.pdf | http://www.madrid.org/contratos-publicos/13548... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | None | NaT |
| https://contrataciondelestado.es/sindicacion/P... | Id Licitación: PcPG/2021/801758, Órgano de Con... | Servicio para la caracterización hidromorfolóx... | [2022-01-02 22:11:15.096000+00:00] | PcPG/2021/801758 | [RES] | https://www.contratosdegalicia.gal//consultaOr... | Ente Público Empresarial Augas de Galicia | <NA> | Servicio para la caracterización hidromorfolóx... | 2.0 | ... | <NA> | <NA> | <NA> | <NA> | A12024974 | <NA> | <NA> | <NA> | None | 2021-06-03 14:00:00+00:00 |
5 rows × 38 columns
The first group with more than one element
a_group_df = grouped.get_group(size_2_groups[0])
a_group_df| id | summary | title | updated | ContractFolderStatus - ContractFolderID | ContractFolderStatus - ContractFolderStatusCode | ContractFolderStatus - LocatedContractingParty - BuyerProfileURIID | ContractFolderStatus - LocatedContractingParty - Party - PartyName - Name | ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - PartyName - Name | ContractFolderStatus - ProcurementProject - Name | ... | ContractFolderStatus - TechnicalDocumentReference - ID | ContractFolderStatus - TechnicalDocumentReference - Attachment - ExternalReference - URI | ContractFolderStatus - ProcurementProject - PlannedPeriod - StartDate | ContractFolderStatus - ProcurementProject - PlannedPeriod - EndDate | ContractFolderStatus - LocatedContractingParty - Party - PartyIdentification - ID | ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - ParentLocatedParty - PartyName - Name | ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndDate | ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndTime | ContractFolderStatus - TenderResult - AwardedTenderedProject - ProcurementProjectLotID | ContractFolderStatus - TenderingProcess - TenderSubmissionDeadlinePeriod | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 32 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: 3069_1/2021; Órgano de Contrata... | Construcción de depósito regulador de 100m3 en... | [2022-01-01 00:00:16.761000+00:00] | 3069_1/2021 | [EV] | <NA> | CONCEJO DE GALBARRA | Concejo de Galbarra | Construcción de depósito regulador de 100m3 en... | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | NaT |
| 19 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: 3069_1/2021; Órgano de Contrata... | Construcción de depósito regulador de 100m3 en... | [2022-01-01 07:30:09.698000+00:00] | 3069_1/2021 | [EV] | <NA> | CONCEJO DE GALBARRA | Concejo de Galbarra | Construcción de depósito regulador de 100m3 en... | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | NaT |
2 rows × 39 columns
The columns whose history is to be kept
For the sake of efficiency
A list (of lists) with the columns that must be kept when keeping only updates
A function which keeps, for every procurement entry, only the last update.
keep_updates_only
keep_updates_only (df:pandas.core.frame.DataFrame)
Keep only the last update for every collection of entries with the same id
| Type | Details | |
|---|---|---|
| df | DataFrame | Input |
| Returns | DataFrame | Output |
only_updated_df = keep_updates_only(post_df)
only_updated_df.head()| id | index | summary | title | ContractFolderStatus - ContractFolderID | ContractFolderStatus - LocatedContractingParty - BuyerProfileURIID | ContractFolderStatus - LocatedContractingParty - Party - PartyName - Name | ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - PartyName - Name | ContractFolderStatus - ProcurementProject - Name | ContractFolderStatus - ProcurementProject - TypeCode | ... | ContractFolderStatus - ProcurementProject - PlannedPeriod - StartDate | ContractFolderStatus - ProcurementProject - PlannedPeriod - EndDate | ContractFolderStatus - LocatedContractingParty - Party - PartyIdentification - ID | ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - ParentLocatedParty - PartyName - Name | ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndDate | ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndTime | ContractFolderStatus - TenderResult - AwardedTenderedProject - ProcurementProjectLotID | ContractFolderStatus - TenderingProcess - TenderSubmissionDeadlinePeriod | updated | ContractFolderStatus - ContractFolderStatusCode | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | https://contrataciondelestado.es/sindicacion/P... | 116 | Id licitación: 0001264/2021; Órgano de contrat... | 2021/pa-44-4 servicio de mantenimiento integra... | 0001264/2021 | <NA> | Agencia Pública Empresarial Sanitaria Bajo Gua... | Junta de Andalucía | 2021/pa-44-4 servicio de mantenimiento integra... | 2.0 | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | 2021-11-11 15:00:00+00:00 | [2021-12-30 23:14:15.739000+00:00] | [RES] |
| 1 | https://contrataciondelestado.es/sindicacion/P... | 115 | Id licitación: 8113_01 2021; Órgano de Contrat... | Contrato del Servicio de Teleasistencia para l... | 8113_01 2021 | <NA> | Agencia Navarra de Autonomía y Desarrollo de l... | Agencia Navarra para la Dependencia | Contrato del Servicio de Teleasistencia para l... | 2.0 | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | NaT | [2021-12-31 00:00:12.604000+00:00] | [EV] |
| 2 | https://contrataciondelestado.es/sindicacion/P... | 114 | Id licitación: 8113_3/2021; Órgano de Contrata... | Contrato de servicios de desinfección, desinse... | 8113_3/2021 | <NA> | Subdirector de Gestión y Recursos | Agencia Navarra para la Dependencia | Contrato de servicios de desinfección, desinse... | 2.0 | ... | 2022-01-01 | 2022-12-31 | <NA> | <NA> | <NA> | <NA> | NaN | NaT | [2021-12-31 00:00:13.594000+00:00] | [EV] |
| 3 | https://contrataciondelestado.es/sindicacion/P... | 113 | Id licitación: 8165_3/2021; Órgano de Contrata... | Asistencia técnica para la prestación del serv... | 8165_3/2021 | <NA> | Mancomunidad de Servicios Sociales de Base de ... | MANCOMUNIDAD DE SERVICIOS DE HUARTE Y DE ESTER... | Asistencia técnica para la prestación del serv... | 2.0 | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | NaT | [2021-12-31 00:00:14.393000+00:00] | [EV] |
| 4 | https://contrataciondelestado.es/sindicacion/P... | 112 | Id licitación: 1005_391-2021; Órgano de Contra... | Apoyo a la gestión del patrimonio filmográfico... | 1005_391-2021 | <NA> | Dirección General de Cultura-Institución Prínc... | Departamento de Cultura, Deporte y Juventud | Apoyo a la gestión del patrimonio filmográfico... | 2.0 | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | 2022-01-14 23:59:00+00:00 | [2021-12-31 00:00:14.946000+00:00] | [PUB] |
5 rows × 40 columns
len(only_updated_df)115
only_updated_df.indexRangeIndex(start=0, stop=115, step=1)
One of the groups (now an individual row) that had more than one row
only_updated_df[only_updated_df['id'] == size_2_groups[0]][assembled_historical_cols]| updated | ContractFolderStatus - ContractFolderStatusCode | |
|---|---|---|
| 95 | [2022-01-01 00:00:16.761000+00:00, 2022-01-01 ... | [EV, EV] |
Deleted series
import sproc.bundlezip_file = directory / 'yearly' / 'PlataformasAgregadasSinMenores_2018.zip'
assert zip_file.exists()
zip_filePosixPath('/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples/yearly/PlataformasAgregadasSinMenores_2018.zip')
deleted_series = sproc.bundle.read_deleted_zip(zip_file)
deleted_series.head(2)zip file name id
PlataformasAgregadasSinMenores_2018.zip PlataformasAgregadasSinMenores_20180217_180137_1.atom https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1985903 2018-01-04 13:11:18.021000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1969197 2018-01-04 13:11:17.921000+00:00
Name: deleted_on, dtype: datetime64[ns, UTC]
No duplicates
deleted_series.index.get_level_values(1).duplicated().any()True
We add an artificial one
new_data = pd.Timestamp(year=2017, month=1, day=3, tz=deleted_series[-1].tz)
deleted_series.loc['foo', deleted_series.index[-1][1], deleted_series.index[-1][2]] = new_dataNow have a duplicate (at level 1, i.e., id)
deleted_series.index.get_level_values(2).duplicated().any()True
A function to get rid of duplicates from a deleted series. The oldest entry is kept.
deduplicate_deleted_series
deduplicate_deleted_series (series:pandas.core.series.Series)
deduplicated_series = deduplicate_deleted_series(deleted_series)
deduplicated_series.tail(2)zip file name id
PlataformasAgregadasSinMenores_2018.zip PlataformasAgregadasSinMenores_20180217_190110_1.atom https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1996094 2018-02-07 09:01:16.378000+00:00
foo PlataformasAgregadasSinMenores_20180217_190110_1.atom https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/2000163 2017-01-03 00:00:00+00:00
Name: deleted_on, dtype: datetime64[ns, UTC]
assert not deduplicated_series.index.get_level_values(2).duplicated().any()deleted_series.shape, deduplicated_series.shape((52,), (51,))