import sproc.xml
In order to avoid circular dependencies in the resulting Python modules, and since this is only for testing
Sample data
Directory where the data (XML files) are stored
= pathlib.Path.cwd().parent / 'samples'
directory assert directory.exists()
A (sample) file in that directory
= directory / 'PlataformasAgregadasSinMenores_20220104_030016_1.atom'
xml_file assert xml_file.exists()
= sproc.xml.to_df(xml_file)
df df
id | summary | title | updated | ContractFolderStatus - ContractFolderID | ContractFolderStatus - ContractFolderStatusCode | ContractFolderStatus - LocatedContractingParty - BuyerProfileURIID | ContractFolderStatus - LocatedContractingParty - Party - PartyName - Name | ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - PartyName - Name | ContractFolderStatus - ProcurementProject - Name | ... | ContractFolderStatus - LegalDocumentReference - Attachment - ExternalReference - URI | ContractFolderStatus - TechnicalDocumentReference - ID | ContractFolderStatus - TechnicalDocumentReference - Attachment - ExternalReference - URI | ContractFolderStatus - ProcurementProject - PlannedPeriod - StartDate | ContractFolderStatus - ProcurementProject - PlannedPeriod - EndDate | ContractFolderStatus - LocatedContractingParty - Party - PartyIdentification - ID | ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - ParentLocatedParty - PartyName - Name | ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndDate | ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndTime | ContractFolderStatus - TenderResult - AwardedTenderedProject - ProcurementProjectLotID | |
0 | | Id licitación: C. 2-2021; Órgano de Contrataci... | L'objecte del contracte és la renovació de tot... | 2022-01-03T01:11:41.826+01:00 | C. 2-2021 | ADJ | | Ajuntament de Sant Ramon | Entitats municipals de Catalunya | L'objecte del contracte és la renovació de tot... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | | Id licitación: 8128_3/2021; Órgano de Contrata... | Obras de restauración hidromorfológica del río... | 2022-01-03T01:00:11.194+01:00 | 8128_3/2021 | PUB | NaN | Pleno del Ayuntamiento | AYUNTAMIENTO DE MONREAL | Obras de restauración hidromorfológica del río... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | | Id licitación: 1000_0005-CP01-2021-000063; Órg... | Contrato del servicio de realización de labore... | 2022-01-03T01:00:10.399+01:00 | 1000_0005-CP01-2021-000063 | EV | NaN | El Director General de Comunicación y Relacion... | Departamento de Presidencia, Igualdad, Función... | Contrato del servicio de realización de labore... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | | Id licitación: 1379/2020 4738; Órgano de Contr... | Obres de renovació de l'enllumenat públic a la... | 2022-01-03T00:11:40.740+01:00 | 1379/2020 4738 | EV | | Ajuntament de Canet de Mar | Entitats municipals de Catalunya | Obres de renovació de l'enllumenat públic a la... | ... | | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | | Id licitación: 2021-44; Órgano de Contratación... | Subministre i la instal·lació fotovoltaica en ... | 2022-01-03T00:11:40.696+01:00 | 2021-44 | EV | | Ajuntament de Valls | Entitats municipals de Catalunya | Subministre i la instal·lació fotovoltaica en ... | ... | | Enllac plec clausules tecniques.doc | | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
112 | | Id licitación: 1005_391-2021; Órgano de Contra... | Apoyo a la gestión del patrimonio filmográfico... | 2021-12-31T01:00:14.946+01:00 | 1005_391-2021 | PUB | NaN | Dirección General de Cultura-Institución Prínc... | Departamento de Cultura, Deporte y Juventud | Apoyo a la gestión del patrimonio filmográfico... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
113 | | Id licitación: 8165_3/2021; Órgano de Contrata... | Asistencia técnica para la prestación del serv... | 2021-12-31T01:00:14.393+01:00 | 8165_3/2021 | EV | NaN | Mancomunidad de Servicios Sociales de Base de ... | MANCOMUNIDAD DE SERVICIOS DE HUARTE Y DE ESTER... | Asistencia técnica para la prestación del serv... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
114 | | Id licitación: 8113_3/2021; Órgano de Contrata... | Contrato de servicios de desinfección, desinse... | 2021-12-31T01:00:13.594+01:00 | 8113_3/2021 | EV | NaN | Subdirector de Gestión y Recursos | Agencia Navarra para la Dependencia | Contrato de servicios de desinfección, desinse... | ... | NaN | NaN | NaN | 2022-01-01 | 2022-12-31 | NaN | NaN | NaN | NaN | NaN |
115 | | Id licitación: 8113_01 2021; Órgano de Contrat... | Contrato del Servicio de Teleasistencia para l... | 2021-12-31T01:00:12.604+01:00 | 8113_01 2021 | EV | NaN | Agencia Navarra de Autonomía y Desarrollo de l... | Agencia Navarra para la Dependencia | Contrato del Servicio de Teleasistencia para l... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
116 | | Id licitación: 0001264/2021; Órgano de contrat... | 2021/pa-44-4 servicio de mantenimiento integra... | 2021-12-31T00:14:15.739+01:00 | 0001264/2021 | RES | NaN | Agencia Pública Empresarial Sanitaria Bajo Gua... | Junta de Andalucía | 2021/pa-44-4 servicio de mantenimiento integra... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
117 rows × 38 columns
There are some multivalued columns
['ContractFolderStatus - ProcurementProject - RequiredCommodityClassification - ItemClassificationCode',
'ContractFolderStatus - TenderResult - ResultCode',
'ContractFolderStatus - TenderResult - ReceivedTenderQuantity',
'ContractFolderStatus - TenderResult - WinningParty - PartyIdentification - ID',
'ContractFolderStatus - TenderResult - WinningParty - PartyName - Name',
'ContractFolderStatus - TenderResult - AwardedTenderedProject - LegalMonetaryTotal - TaxExclusiveAmount',
'ContractFolderStatus - ValidNoticeInfo - NoticeTypeCode',
'ContractFolderStatus - ValidNoticeInfo - AdditionalPublicationStatus - PublicationMediaName',
'ContractFolderStatus - ValidNoticeInfo - AdditionalPublicationStatus - AdditionalPublicationDocumentReference - IssueDate',
'ContractFolderStatus - TenderResult - AwardedTenderedProject - ProcurementProjectLotID']
Columns’ types
A regular expression matching any column whose final component is PostalZone
For instance
'foo - PostalZone') re_postal_zone.match(
<re.Match object; span=(0, 16), match='foo - PostalZone'>
'Address - Number') re_postal_zone.match(
Two positives
assert re_postal_zone.match(sproc.structure.assemble_name(['ContractFolderStatus', 'ProcurementProject', 'RealizedLocation', 'Address', 'PostalZone']))
assert re_postal_zone.match(sproc.structure.assemble_name(['ContractFolderStatus', 'LocatedContractingParty', 'Party', 'PostalAddress', 'PostalZone']))
A negative
assert not re_postal_zone.match(sproc.structure.assemble_name(['ContractFolderStatus', 'Wap']))
The list below specifies fields that are to be interpreted as str
2] str_columns[:
For easier processing, the list
s are turned into str
These columns, if present,
are parsed jointly into a new column
The column indicating the status is also gonna be exploited
A function to tidy up some things in the pd.DataFrame
s returned by sproc.xml.to_df
typecast_columns (input_df:pandas.core.frame.DataFrame)
Tidy up the pd.DataFrame
returned by to_df
Type | Details | |
input_df | DataFrame | Input DataFrame as ready by to_df |
Returns | DataFrame | Post-processed DataFrame |
= typecast_columns(df)
post_df post_df.head()
id | summary | title | updated | ContractFolderStatus - ContractFolderID | ContractFolderStatus - ContractFolderStatusCode | ContractFolderStatus - LocatedContractingParty - BuyerProfileURIID | ContractFolderStatus - LocatedContractingParty - Party - PartyName - Name | ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - PartyName - Name | ContractFolderStatus - ProcurementProject - Name | ... | ContractFolderStatus - TechnicalDocumentReference - ID | ContractFolderStatus - TechnicalDocumentReference - Attachment - ExternalReference - URI | ContractFolderStatus - ProcurementProject - PlannedPeriod - StartDate | ContractFolderStatus - ProcurementProject - PlannedPeriod - EndDate | ContractFolderStatus - LocatedContractingParty - Party - PartyIdentification - ID | ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - ParentLocatedParty - PartyName - Name | ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndDate | ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndTime | ContractFolderStatus - TenderResult - AwardedTenderedProject - ProcurementProjectLotID | ContractFolderStatus - TenderingProcess - TenderSubmissionDeadlinePeriod | |
0 | | Id licitación: C. 2-2021; Órgano de Contrataci... | L'objecte del contracte és la renovació de tot... | [2022-01-03 00:11:41.826000+00:00] | C. 2-2021 | [ADJ] | | Ajuntament de Sant Ramon | Entitats municipals de Catalunya | L'objecte del contracte és la renovació de tot... | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | 2021-12-17 14:00:00+00:00 |
1 | | Id licitación: 8128_3/2021; Órgano de Contrata... | Obras de restauración hidromorfológica del río... | [2022-01-03 00:00:11.194000+00:00] | 8128_3/2021 | [PUB] | <NA> | Pleno del Ayuntamiento | AYUNTAMIENTO DE MONREAL | Obras de restauración hidromorfológica del río... | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | 2022-01-22 23:30:00+00:00 |
2 | | Id licitación: 1000_0005-CP01-2021-000063; Órg... | Contrato del servicio de realización de labore... | [2022-01-03 00:00:10.399000+00:00] | 1000_0005-CP01-2021-000063 | [EV] | <NA> | El Director General de Comunicación y Relacion... | Departamento de Presidencia, Igualdad, Función... | Contrato del servicio de realización de labore... | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | NaT |
3 | | Id licitación: 1379/2020 4738; Órgano de Contr... | Obres de renovació de l'enllumenat públic a la... | [2022-01-02 23:11:40.740000+00:00] | 1379/2020 4738 | [EV] | | Ajuntament de Canet de Mar | Entitats municipals de Catalunya | Obres de renovació de l'enllumenat públic a la... | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | 2022-01-02 23:59:00+00:00 |
4 | | Id licitación: 2021-44; Órgano de Contratación... | Subministre i la instal·lació fotovoltaica en ... | [2022-01-02 23:11:40.696000+00:00] | 2021-44 | [EV] | | Ajuntament de Valls | Entitats municipals de Catalunya | Subministre i la instal·lació fotovoltaica en ... | ... | Enllac plec clausules tecniques.doc | | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | 2022-01-02 23:59:00+00:00 |
5 rows × 39 columns
After post-processing some object
s have been parsed as dates.
Multivalued columns are still multivalued…but we have a couple of extras one (updated
and ContractFolderStatus - ContractFolderStatusCode
# assert sproc.structure.multivalued_columns(post_df) == sproc.structure.multivalued_columns(df) # <---------------------
assert len(sproc.structure.multivalued_columns(post_df)) == len(sproc.structure.multivalued_columns(df)) + 2
Most recent update
updated, after post-processing is an object
not apt for ordering
'updated'].dtype post_df[
We order the entries by updated date (ascending order) and then group by id. Notice that as a previous step to ordering, tha maximum of all the elements in updated
is computed.
= post_df.sort_values('updated', key=np.vectorize(max)).groupby('id') grouped
We are interested in groups with more than one element
= (grouped.size() > 1)
not_one_element_group not_one_element_group
id False False False False False
... False False False False False
Length: 115, dtype: bool
Number of tenders with updates
sum() not_one_element_group.
= not_one_element_group.index[not_one_element_group]
actual_groups actual_groups
Index(['', ''], dtype='string', name='id')
There is a group with this number of elements
max() grouped.size().
For getting groups with exactly a certain number of elements (here \(2\) again)
= (grouped.size() == 2).loc[lambda x: x].index
size_2_groups size_2_groups
Index(['', ''], dtype='string', name='id')
= grouped.get_group(actual_groups[0])
first_group first_group
id | summary | title | updated | ContractFolderStatus - ContractFolderID | ContractFolderStatus - ContractFolderStatusCode | ContractFolderStatus - LocatedContractingParty - BuyerProfileURIID | ContractFolderStatus - LocatedContractingParty - Party - PartyName - Name | ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - PartyName - Name | ContractFolderStatus - ProcurementProject - Name | ... | ContractFolderStatus - TechnicalDocumentReference - ID | ContractFolderStatus - TechnicalDocumentReference - Attachment - ExternalReference - URI | ContractFolderStatus - ProcurementProject - PlannedPeriod - StartDate | ContractFolderStatus - ProcurementProject - PlannedPeriod - EndDate | ContractFolderStatus - LocatedContractingParty - Party - PartyIdentification - ID | ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - ParentLocatedParty - PartyName - Name | ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndDate | ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndTime | ContractFolderStatus - TenderResult - AwardedTenderedProject - ProcurementProjectLotID | ContractFolderStatus - TenderingProcess - TenderSubmissionDeadlinePeriod | |
32 | | Id licitación: 3069_1/2021; Órgano de Contrata... | Construcción de depósito regulador de 100m3 en... | [2022-01-01 00:00:16.761000+00:00] | 3069_1/2021 | [EV] | <NA> | CONCEJO DE GALBARRA | Concejo de Galbarra | Construcción de depósito regulador de 100m3 en... | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | NaT |
19 | | Id licitación: 3069_1/2021; Órgano de Contrata... | Construcción de depósito regulador de 100m3 en... | [2022-01-01 07:30:09.698000+00:00] | 3069_1/2021 | [EV] | <NA> | CONCEJO DE GALBARRA | Concejo de Galbarra | Construcción de depósito regulador de 100m3 en... | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | NaT |
2 rows × 39 columns
if columns are different across elements
= first_group.iloc[0] != first_group.iloc[-1]
columns_are_different columns_are_different
id False
summary False
title False
updated True
ContractFolderStatus - ContractFolderID False
ContractFolderStatus - ContractFolderStatusCode False
ContractFolderStatus - LocatedContractingParty - BuyerProfileURIID True
ContractFolderStatus - LocatedContractingParty - Party - PartyName - Name False
ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - PartyName - Name False
ContractFolderStatus - ProcurementProject - Name False
ContractFolderStatus - ProcurementProject - TypeCode False
ContractFolderStatus - ProcurementProject - BudgetAmount - EstimatedOverallContractAmount False
ContractFolderStatus - ProcurementProject - BudgetAmount - TaxExclusiveAmount False
ContractFolderStatus - ProcurementProject - RequiredCommodityClassification - ItemClassificationCode False
ContractFolderStatus - ProcurementProject - RealizedLocation - CountrySubentityCode False
ContractFolderStatus - ProcurementProject - PlannedPeriod - DurationMeasure False
ContractFolderStatus - TenderResult - ResultCode True
ContractFolderStatus - TenderResult - ReceivedTenderQuantity True
ContractFolderStatus - TenderResult - WinningParty - PartyIdentification - ID True
ContractFolderStatus - TenderResult - WinningParty - PartyName - Name True
ContractFolderStatus - TenderResult - AwardedTenderedProject - LegalMonetaryTotal - TaxExclusiveAmount True
ContractFolderStatus - TenderingProcess - ProcedureCode False
ContractFolderStatus - TenderingProcess - TenderSubmissionDeadlinePeriod - EndDate True
ContractFolderStatus - TenderingProcess - TenderSubmissionDeadlinePeriod - EndTime True
ContractFolderStatus - ValidNoticeInfo - NoticeTypeCode False
ContractFolderStatus - ValidNoticeInfo - AdditionalPublicationStatus - PublicationMediaName False
ContractFolderStatus - ValidNoticeInfo - AdditionalPublicationStatus - AdditionalPublicationDocumentReference - IssueDate True
ContractFolderStatus - LegalDocumentReference - ID True
ContractFolderStatus - LegalDocumentReference - Attachment - ExternalReference - URI True
ContractFolderStatus - TechnicalDocumentReference - ID True
ContractFolderStatus - TechnicalDocumentReference - Attachment - ExternalReference - URI True
ContractFolderStatus - ProcurementProject - PlannedPeriod - StartDate True
ContractFolderStatus - ProcurementProject - PlannedPeriod - EndDate True
ContractFolderStatus - LocatedContractingParty - Party - PartyIdentification - ID True
ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - ParentLocatedParty - PartyName - Name True
ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndDate True
ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndTime True
ContractFolderStatus - TenderResult - AwardedTenderedProject - ProcurementProjectLotID True
ContractFolderStatus - TenderingProcess - TenderSubmissionDeadlinePeriod True
dtype: bool
The values in those columns are (besides updated) all nan
updated | ContractFolderStatus - LocatedContractingParty - BuyerProfileURIID | ContractFolderStatus - TenderResult - ResultCode | ContractFolderStatus - TenderResult - ReceivedTenderQuantity | ContractFolderStatus - TenderResult - WinningParty - PartyIdentification - ID | ContractFolderStatus - TenderResult - WinningParty - PartyName - Name | ContractFolderStatus - TenderResult - AwardedTenderedProject - LegalMonetaryTotal - TaxExclusiveAmount | ContractFolderStatus - TenderingProcess - TenderSubmissionDeadlinePeriod - EndDate | ContractFolderStatus - TenderingProcess - TenderSubmissionDeadlinePeriod - EndTime | ContractFolderStatus - ValidNoticeInfo - AdditionalPublicationStatus - AdditionalPublicationDocumentReference - IssueDate | ... | ContractFolderStatus - TechnicalDocumentReference - ID | ContractFolderStatus - TechnicalDocumentReference - Attachment - ExternalReference - URI | ContractFolderStatus - ProcurementProject - PlannedPeriod - StartDate | ContractFolderStatus - ProcurementProject - PlannedPeriod - EndDate | ContractFolderStatus - LocatedContractingParty - Party - PartyIdentification - ID | ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - ParentLocatedParty - PartyName - Name | ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndDate | ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndTime | ContractFolderStatus - TenderResult - AwardedTenderedProject - ProcurementProjectLotID | ContractFolderStatus - TenderingProcess - TenderSubmissionDeadlinePeriod | |
32 | [2022-01-01 00:00:16.761000+00:00] | <NA> | NaN | NaN | NaN | NaN | NaN | <NA> | <NA> | [[2021-12-16, 2021-12-16, 2021-12-16, 2021-12-... | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | NaT |
19 | [2022-01-01 07:30:09.698000+00:00] | <NA> | NaN | NaN | NaN | NaN | NaN | <NA> | <NA> | [[2021-12-16, 2021-12-16, 2021-12-16, 2021-12-... | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | NaT |
2 rows × 22 columns
Only the last element of each group (the most recent entry) is kept
= grouped.last()
only_last_update_df only_last_update_df.head()
summary | title | updated | ContractFolderStatus - ContractFolderID | ContractFolderStatus - ContractFolderStatusCode | ContractFolderStatus - LocatedContractingParty - BuyerProfileURIID | ContractFolderStatus - LocatedContractingParty - Party - PartyName - Name | ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - PartyName - Name | ContractFolderStatus - ProcurementProject - Name | ContractFolderStatus - ProcurementProject - TypeCode | ... | ContractFolderStatus - TechnicalDocumentReference - ID | ContractFolderStatus - TechnicalDocumentReference - Attachment - ExternalReference - URI | ContractFolderStatus - ProcurementProject - PlannedPeriod - StartDate | ContractFolderStatus - ProcurementProject - PlannedPeriod - EndDate | ContractFolderStatus - LocatedContractingParty - Party - PartyIdentification - ID | ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - ParentLocatedParty - PartyName - Name | ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndDate | ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndTime | ContractFolderStatus - TenderResult - AwardedTenderedProject - ProcurementProjectLotID | ContractFolderStatus - TenderingProcess - TenderSubmissionDeadlinePeriod | |
id | ||||||||||||||||||||| | Id licitación: 6011900114;Órgano de Contrataci... | Servicio de migración de los productos BMC ins... | [2021-12-31 10:56:15.855000+00:00] | 6011900114 | [RES] | | Empresa Pública de Metro de Madrid, S.A. | Consejería de Transportes e Infraestructuras | Servicio de migración de los productos BMC ins... | 2.0 | ... | 1354764548025.pdf | | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | None | NaT | | Id licitación: 11/2019; Ó“rgano de Contratació... | Concurso de proyectos para la rehabilitación u... | [2021-12-31 10:41:20.201000+00:00] | 11/2019 | [EV] | <NA> | Alcaldia | Ayuntamiento de Leioa | Concurso de proyectos para la rehabilitación u... | 2.0 | ... | <NA> | <NA> | <NA> | <NA> | <NA> | Ayuntamiento de Leioa | <NA> | <NA> | None | 2020-01-23 17:00:00+00:00 | | Id licitación: 6012000208;Órgano de Contrataci... | Suministro de equipos de manutención (carretil... | [2021-12-31 11:11:15.851000+00:00] | 6012000208 | [RES] | | Empresa Pública de Metro de Madrid, S.A. | Consejería de Transportes e Infraestructuras | Suministro de equipos de manutención (carretil... | 1.0 | ... | 1354836373815.pdf | | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | None | NaT | | Id licitación: 6012100119;Órgano de Contrataci... | Servicio de soporte al mantenimiento de instal... | [2021-12-31 10:11:16.027000+00:00] | 6012100119 | [RES] | | Empresa Pública de Metro de Madrid, S.A. | Consejería de Transportes e Infraestructuras | Servicio de soporte al mantenimiento de instal... | 2.0 | ... | 1354874714036.pdf | | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | None | NaT | | Id Licitación: PcPG/2021/801758, Órgano de Con... | Servicio para la caracterización hidromorfolóx... | [2022-01-02 22:11:15.096000+00:00] | PcPG/2021/801758 | [RES] | | Ente Público Empresarial Augas de Galicia | <NA> | Servicio para la caracterización hidromorfolóx... | 2.0 | ... | <NA> | <NA> | <NA> | <NA> | A12024974 | <NA> | <NA> | <NA> | None | 2021-06-03 14:00:00+00:00 |
5 rows × 38 columns
The first group with more than one element
= grouped.get_group(size_2_groups[0])
a_group_df a_group_df
id | summary | title | updated | ContractFolderStatus - ContractFolderID | ContractFolderStatus - ContractFolderStatusCode | ContractFolderStatus - LocatedContractingParty - BuyerProfileURIID | ContractFolderStatus - LocatedContractingParty - Party - PartyName - Name | ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - PartyName - Name | ContractFolderStatus - ProcurementProject - Name | ... | ContractFolderStatus - TechnicalDocumentReference - ID | ContractFolderStatus - TechnicalDocumentReference - Attachment - ExternalReference - URI | ContractFolderStatus - ProcurementProject - PlannedPeriod - StartDate | ContractFolderStatus - ProcurementProject - PlannedPeriod - EndDate | ContractFolderStatus - LocatedContractingParty - Party - PartyIdentification - ID | ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - ParentLocatedParty - PartyName - Name | ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndDate | ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndTime | ContractFolderStatus - TenderResult - AwardedTenderedProject - ProcurementProjectLotID | ContractFolderStatus - TenderingProcess - TenderSubmissionDeadlinePeriod | |
32 | | Id licitación: 3069_1/2021; Órgano de Contrata... | Construcción de depósito regulador de 100m3 en... | [2022-01-01 00:00:16.761000+00:00] | 3069_1/2021 | [EV] | <NA> | CONCEJO DE GALBARRA | Concejo de Galbarra | Construcción de depósito regulador de 100m3 en... | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | NaT |
19 | | Id licitación: 3069_1/2021; Órgano de Contrata... | Construcción de depósito regulador de 100m3 en... | [2022-01-01 07:30:09.698000+00:00] | 3069_1/2021 | [EV] | <NA> | CONCEJO DE GALBARRA | Concejo de Galbarra | Construcción de depósito regulador de 100m3 en... | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | NaT |
2 rows × 39 columns
The columns whose history is to be kept
For the sake of efficiency
A list (of lists) with the columns that must be kept when keeping only updates
A function which keeps, for every procurement entry, only the last update.
keep_updates_only (df:pandas.core.frame.DataFrame)
Keep only the last update for every collection of entries with the same id
Type | Details | |
df | DataFrame | Input |
Returns | DataFrame | Output |
= keep_updates_only(post_df)
only_updated_df only_updated_df.head()
id | index | summary | title | ContractFolderStatus - ContractFolderID | ContractFolderStatus - LocatedContractingParty - BuyerProfileURIID | ContractFolderStatus - LocatedContractingParty - Party - PartyName - Name | ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - PartyName - Name | ContractFolderStatus - ProcurementProject - Name | ContractFolderStatus - ProcurementProject - TypeCode | ... | ContractFolderStatus - ProcurementProject - PlannedPeriod - StartDate | ContractFolderStatus - ProcurementProject - PlannedPeriod - EndDate | ContractFolderStatus - LocatedContractingParty - Party - PartyIdentification - ID | ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - ParentLocatedParty - PartyName - Name | ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndDate | ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndTime | ContractFolderStatus - TenderResult - AwardedTenderedProject - ProcurementProjectLotID | ContractFolderStatus - TenderingProcess - TenderSubmissionDeadlinePeriod | updated | ContractFolderStatus - ContractFolderStatusCode | |
0 | | 116 | Id licitación: 0001264/2021; Órgano de contrat... | 2021/pa-44-4 servicio de mantenimiento integra... | 0001264/2021 | <NA> | Agencia Pública Empresarial Sanitaria Bajo Gua... | Junta de Andalucía | 2021/pa-44-4 servicio de mantenimiento integra... | 2.0 | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | 2021-11-11 15:00:00+00:00 | [2021-12-30 23:14:15.739000+00:00] | [RES] |
1 | | 115 | Id licitación: 8113_01 2021; Órgano de Contrat... | Contrato del Servicio de Teleasistencia para l... | 8113_01 2021 | <NA> | Agencia Navarra de Autonomía y Desarrollo de l... | Agencia Navarra para la Dependencia | Contrato del Servicio de Teleasistencia para l... | 2.0 | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | NaT | [2021-12-31 00:00:12.604000+00:00] | [EV] |
2 | | 114 | Id licitación: 8113_3/2021; Órgano de Contrata... | Contrato de servicios de desinfección, desinse... | 8113_3/2021 | <NA> | Subdirector de Gestión y Recursos | Agencia Navarra para la Dependencia | Contrato de servicios de desinfección, desinse... | 2.0 | ... | 2022-01-01 | 2022-12-31 | <NA> | <NA> | <NA> | <NA> | NaN | NaT | [2021-12-31 00:00:13.594000+00:00] | [EV] |
3 | | 113 | Id licitación: 8165_3/2021; Órgano de Contrata... | Asistencia técnica para la prestación del serv... | 8165_3/2021 | <NA> | Mancomunidad de Servicios Sociales de Base de ... | MANCOMUNIDAD DE SERVICIOS DE HUARTE Y DE ESTER... | Asistencia técnica para la prestación del serv... | 2.0 | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | NaT | [2021-12-31 00:00:14.393000+00:00] | [EV] |
4 | | 112 | Id licitación: 1005_391-2021; Órgano de Contra... | Apoyo a la gestión del patrimonio filmográfico... | 1005_391-2021 | <NA> | Dirección General de Cultura-Institución Prínc... | Departamento de Cultura, Deporte y Juventud | Apoyo a la gestión del patrimonio filmográfico... | 2.0 | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | 2022-01-14 23:59:00+00:00 | [2021-12-31 00:00:14.946000+00:00] | [PUB] |
5 rows × 40 columns
RangeIndex(start=0, stop=115, step=1)
One of the groups (now an individual row) that had more than one row
'id'] == size_2_groups[0]][assembled_historical_cols] only_updated_df[only_updated_df[
updated | ContractFolderStatus - ContractFolderStatusCode | |
95 | [2022-01-01 00:00:16.761000+00:00, 2022-01-01 ... | [EV, EV] |
Deleted series
import sproc.bundle
= directory / 'yearly' / ''
zip_file assert zip_file.exists()
= sproc.bundle.read_deleted_zip(zip_file)
deleted_series 2) deleted_series.head(
zip file name id PlataformasAgregadasSinMenores_20180217_180137_1.atom 2018-01-04 13:11:18.021000+00:00 2018-01-04 13:11:17.921000+00:00
Name: deleted_on, dtype: datetime64[ns, UTC]
No duplicates
1).duplicated().any() deleted_series.index.get_level_values(
We add an artificial one
= pd.Timestamp(year=2017, month=1, day=3, tz=deleted_series[-1].tz)
new_data 'foo', deleted_series.index[-1][1], deleted_series.index[-1][2]] = new_data deleted_series.loc[
Now have a duplicate (at level 1, i.e., id
2).duplicated().any() deleted_series.index.get_level_values(
A function to get rid of duplicates from a deleted series. The oldest entry is kept.
deduplicate_deleted_series (series:pandas.core.series.Series)
= deduplicate_deleted_series(deleted_series)
deduplicated_series 2) deduplicated_series.tail(
zip file name id PlataformasAgregadasSinMenores_20180217_190110_1.atom 2018-02-07 09:01:16.378000+00:00
foo PlataformasAgregadasSinMenores_20180217_190110_1.atom 2017-01-03 00:00:00+00:00
Name: deleted_on, dtype: datetime64[ns, UTC]
assert not deduplicated_series.index.get_level_values(2).duplicated().any()
deleted_series.shape, deduplicated_series.shape
((52,), (51,))