import sproc.xml
postprocess
In order to avoid circular dependencies in the resulting Python modules, and since this is only for testing
Sample data
Directory where the data (XML files) are stored
= pathlib.Path.cwd().parent / 'samples'
directory assert directory.exists()
directory
PosixPath('/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples')
A (sample) file in that directory
= directory / 'PlataformasAgregadasSinMenores_20220104_030016_1.atom'
xml_file assert xml_file.exists()
xml_file
PosixPath('/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples/PlataformasAgregadasSinMenores_20220104_030016_1.atom')
= sproc.xml.to_df(xml_file)
df df
id | summary | title | updated | ContractFolderStatus - ContractFolderID | ContractFolderStatus - ContractFolderStatusCode | ContractFolderStatus - LocatedContractingParty - BuyerProfileURIID | ContractFolderStatus - LocatedContractingParty - Party - PartyName - Name | ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - PartyName - Name | ContractFolderStatus - ProcurementProject - Name | ... | ContractFolderStatus - LegalDocumentReference - Attachment - ExternalReference - URI | ContractFolderStatus - TechnicalDocumentReference - ID | ContractFolderStatus - TechnicalDocumentReference - Attachment - ExternalReference - URI | ContractFolderStatus - ProcurementProject - PlannedPeriod - StartDate | ContractFolderStatus - ProcurementProject - PlannedPeriod - EndDate | ContractFolderStatus - LocatedContractingParty - Party - PartyIdentification - ID | ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - ParentLocatedParty - PartyName - Name | ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndDate | ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndTime | ContractFolderStatus - TenderResult - AwardedTenderedProject - ProcurementProjectLotID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: C. 2-2021; Órgano de Contrataci... | L'objecte del contracte és la renovació de tot... | 2022-01-03T01:11:41.826+01:00 | C. 2-2021 | ADJ | https://contractaciopublica.gencat.cat/ecofin_... | Ajuntament de Sant Ramon | Entitats municipals de Catalunya | L'objecte del contracte és la renovació de tot... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: 8128_3/2021; Órgano de Contrata... | Obras de restauración hidromorfológica del río... | 2022-01-03T01:00:11.194+01:00 | 8128_3/2021 | PUB | NaN | Pleno del Ayuntamiento | AYUNTAMIENTO DE MONREAL | Obras de restauración hidromorfológica del río... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: 1000_0005-CP01-2021-000063; Órg... | Contrato del servicio de realización de labore... | 2022-01-03T01:00:10.399+01:00 | 1000_0005-CP01-2021-000063 | EV | NaN | El Director General de Comunicación y Relacion... | Departamento de Presidencia, Igualdad, Función... | Contrato del servicio de realización de labore... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: 1379/2020 4738; Órgano de Contr... | Obres de renovació de l'enllumenat públic a la... | 2022-01-03T00:11:40.740+01:00 | 1379/2020 4738 | EV | https://contractaciopublica.gencat.cat/ecofin_... | Ajuntament de Canet de Mar | Entitats municipals de Catalunya | Obres de renovació de l'enllumenat públic a la... | ... | https://contractaciopublica.gencat.cat/ecofin_... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: 2021-44; Órgano de Contratación... | Subministre i la instal·lació fotovoltaica en ... | 2022-01-03T00:11:40.696+01:00 | 2021-44 | EV | https://contractaciopublica.gencat.cat/ecofin_... | Ajuntament de Valls | Entitats municipals de Catalunya | Subministre i la instal·lació fotovoltaica en ... | ... | https://contractaciopublica.gencat.cat/ecofin_... | Enllac plec clausules tecniques.doc | https://contractaciopublica.gencat.cat/ecofin_... | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
112 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: 1005_391-2021; Órgano de Contra... | Apoyo a la gestión del patrimonio filmográfico... | 2021-12-31T01:00:14.946+01:00 | 1005_391-2021 | PUB | NaN | Dirección General de Cultura-Institución Prínc... | Departamento de Cultura, Deporte y Juventud | Apoyo a la gestión del patrimonio filmográfico... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
113 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: 8165_3/2021; Órgano de Contrata... | Asistencia técnica para la prestación del serv... | 2021-12-31T01:00:14.393+01:00 | 8165_3/2021 | EV | NaN | Mancomunidad de Servicios Sociales de Base de ... | MANCOMUNIDAD DE SERVICIOS DE HUARTE Y DE ESTER... | Asistencia técnica para la prestación del serv... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
114 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: 8113_3/2021; Órgano de Contrata... | Contrato de servicios de desinfección, desinse... | 2021-12-31T01:00:13.594+01:00 | 8113_3/2021 | EV | NaN | Subdirector de Gestión y Recursos | Agencia Navarra para la Dependencia | Contrato de servicios de desinfección, desinse... | ... | NaN | NaN | NaN | 2022-01-01 | 2022-12-31 | NaN | NaN | NaN | NaN | NaN |
115 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: 8113_01 2021; Órgano de Contrat... | Contrato del Servicio de Teleasistencia para l... | 2021-12-31T01:00:12.604+01:00 | 8113_01 2021 | EV | NaN | Agencia Navarra de Autonomía y Desarrollo de l... | Agencia Navarra para la Dependencia | Contrato del Servicio de Teleasistencia para l... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
116 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: 0001264/2021; Órgano de contrat... | 2021/pa-44-4 servicio de mantenimiento integra... | 2021-12-31T00:14:15.739+01:00 | 0001264/2021 | RES | NaN | Agencia Pública Empresarial Sanitaria Bajo Gua... | Junta de Andalucía | 2021/pa-44-4 servicio de mantenimiento integra... | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
117 rows × 38 columns
There are some multivalued columns
sproc.structure.multivalued_columns(df)
['ContractFolderStatus - ProcurementProject - RequiredCommodityClassification - ItemClassificationCode',
'ContractFolderStatus - TenderResult - ResultCode',
'ContractFolderStatus - TenderResult - ReceivedTenderQuantity',
'ContractFolderStatus - TenderResult - WinningParty - PartyIdentification - ID',
'ContractFolderStatus - TenderResult - WinningParty - PartyName - Name',
'ContractFolderStatus - TenderResult - AwardedTenderedProject - LegalMonetaryTotal - TaxExclusiveAmount',
'ContractFolderStatus - ValidNoticeInfo - NoticeTypeCode',
'ContractFolderStatus - ValidNoticeInfo - AdditionalPublicationStatus - PublicationMediaName',
'ContractFolderStatus - ValidNoticeInfo - AdditionalPublicationStatus - AdditionalPublicationDocumentReference - IssueDate',
'ContractFolderStatus - TenderResult - AwardedTenderedProject - ProcurementProjectLotID']
Columns’ types
A regular expression matching any column whose final component is PostalZone
For instance
'foo - PostalZone') re_postal_zone.match(
<re.Match object; span=(0, 16), match='foo - PostalZone'>
but
'Address - Number') re_postal_zone.match(
Two positives
assert re_postal_zone.match(sproc.structure.assemble_name(['ContractFolderStatus', 'ProcurementProject', 'RealizedLocation', 'Address', 'PostalZone']))
assert re_postal_zone.match(sproc.structure.assemble_name(['ContractFolderStatus', 'LocatedContractingParty', 'Party', 'PostalAddress', 'PostalZone']))
A negative
assert not re_postal_zone.match(sproc.structure.assemble_name(['ContractFolderStatus', 'Wap']))
The list below specifies fields that are to be interpreted as str
2] str_columns[:
[]
For easier processing, the list
s are turned into str
assembled_str_columns
[]
These columns, if present,
are parsed jointly into a new column
The column indicating the status is also gonna be exploited
A function to tidy up some things in the pd.DataFrame
s returned by sproc.xml.to_df
typecast_columns
typecast_columns (input_df:pandas.core.frame.DataFrame)
Tidy up the pd.DataFrame
returned by to_df
Type | Details | |
---|---|---|
input_df | DataFrame | Input DataFrame as ready by to_df |
Returns | DataFrame | Post-processed DataFrame |
= typecast_columns(df)
post_df post_df.head()
id | summary | title | updated | ContractFolderStatus - ContractFolderID | ContractFolderStatus - ContractFolderStatusCode | ContractFolderStatus - LocatedContractingParty - BuyerProfileURIID | ContractFolderStatus - LocatedContractingParty - Party - PartyName - Name | ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - PartyName - Name | ContractFolderStatus - ProcurementProject - Name | ... | ContractFolderStatus - TechnicalDocumentReference - ID | ContractFolderStatus - TechnicalDocumentReference - Attachment - ExternalReference - URI | ContractFolderStatus - ProcurementProject - PlannedPeriod - StartDate | ContractFolderStatus - ProcurementProject - PlannedPeriod - EndDate | ContractFolderStatus - LocatedContractingParty - Party - PartyIdentification - ID | ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - ParentLocatedParty - PartyName - Name | ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndDate | ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndTime | ContractFolderStatus - TenderResult - AwardedTenderedProject - ProcurementProjectLotID | ContractFolderStatus - TenderingProcess - TenderSubmissionDeadlinePeriod | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: C. 2-2021; Órgano de Contrataci... | L'objecte del contracte és la renovació de tot... | [2022-01-03 00:11:41.826000+00:00] | C. 2-2021 | [ADJ] | https://contractaciopublica.gencat.cat/ecofin_... | Ajuntament de Sant Ramon | Entitats municipals de Catalunya | L'objecte del contracte és la renovació de tot... | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | 2021-12-17 14:00:00+00:00 |
1 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: 8128_3/2021; Órgano de Contrata... | Obras de restauración hidromorfológica del río... | [2022-01-03 00:00:11.194000+00:00] | 8128_3/2021 | [PUB] | <NA> | Pleno del Ayuntamiento | AYUNTAMIENTO DE MONREAL | Obras de restauración hidromorfológica del río... | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | 2022-01-22 23:30:00+00:00 |
2 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: 1000_0005-CP01-2021-000063; Órg... | Contrato del servicio de realización de labore... | [2022-01-03 00:00:10.399000+00:00] | 1000_0005-CP01-2021-000063 | [EV] | <NA> | El Director General de Comunicación y Relacion... | Departamento de Presidencia, Igualdad, Función... | Contrato del servicio de realización de labore... | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | NaT |
3 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: 1379/2020 4738; Órgano de Contr... | Obres de renovació de l'enllumenat públic a la... | [2022-01-02 23:11:40.740000+00:00] | 1379/2020 4738 | [EV] | https://contractaciopublica.gencat.cat/ecofin_... | Ajuntament de Canet de Mar | Entitats municipals de Catalunya | Obres de renovació de l'enllumenat públic a la... | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | 2022-01-02 23:59:00+00:00 |
4 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: 2021-44; Órgano de Contratación... | Subministre i la instal·lació fotovoltaica en ... | [2022-01-02 23:11:40.696000+00:00] | 2021-44 | [EV] | https://contractaciopublica.gencat.cat/ecofin_... | Ajuntament de Valls | Entitats municipals de Catalunya | Subministre i la instal·lació fotovoltaica en ... | ... | Enllac plec clausules tecniques.doc | https://contractaciopublica.gencat.cat/ecofin_... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | 2022-01-02 23:59:00+00:00 |
5 rows × 39 columns
After post-processing some object
s have been parsed as dates.
Multivalued columns are still multivalued…but we have a couple of extras one (updated
and ContractFolderStatus - ContractFolderStatusCode
)
# assert sproc.structure.multivalued_columns(post_df) == sproc.structure.multivalued_columns(df) # <---------------------
assert len(sproc.structure.multivalued_columns(post_df)) == len(sproc.structure.multivalued_columns(df)) + 2
Most recent update
updated, after post-processing is an object
not apt for ordering
'updated'].dtype post_df[
dtype('O')
We order the entries by updated date (ascending order) and then group by id. Notice that as a previous step to ordering, tha maximum of all the elements in updated
is computed.
= post_df.sort_values('updated', key=np.vectorize(max)).groupby('id') grouped
We are interested in groups with more than one element
= (grouped.size() > 1)
not_one_element_group not_one_element_group
id
https://contrataciondelestado.es/sindicacion/P... False
https://contrataciondelestado.es/sindicacion/P... False
https://contrataciondelestado.es/sindicacion/P... False
https://contrataciondelestado.es/sindicacion/P... False
https://contrataciondelestado.es/sindicacion/P... False
...
https://contrataciondelestado.es/sindicacion/P... False
https://contrataciondelestado.es/sindicacion/P... False
https://contrataciondelestado.es/sindicacion/P... False
https://contrataciondelestado.es/sindicacion/P... False
https://contrataciondelestado.es/sindicacion/P... False
Length: 115, dtype: bool
Number of tenders with updates
sum() not_one_element_group.
2
= not_one_element_group.index[not_one_element_group]
actual_groups actual_groups
Index(['https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/8904280', 'https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/8994209'], dtype='string', name='id')
There is a group with this number of elements
max() grouped.size().
2
For getting groups with exactly a certain number of elements (here \(2\) again)
= (grouped.size() == 2).loc[lambda x: x].index
size_2_groups size_2_groups
Index(['https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/8904280', 'https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/8994209'], dtype='string', name='id')
= grouped.get_group(actual_groups[0])
first_group first_group
id | summary | title | updated | ContractFolderStatus - ContractFolderID | ContractFolderStatus - ContractFolderStatusCode | ContractFolderStatus - LocatedContractingParty - BuyerProfileURIID | ContractFolderStatus - LocatedContractingParty - Party - PartyName - Name | ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - PartyName - Name | ContractFolderStatus - ProcurementProject - Name | ... | ContractFolderStatus - TechnicalDocumentReference - ID | ContractFolderStatus - TechnicalDocumentReference - Attachment - ExternalReference - URI | ContractFolderStatus - ProcurementProject - PlannedPeriod - StartDate | ContractFolderStatus - ProcurementProject - PlannedPeriod - EndDate | ContractFolderStatus - LocatedContractingParty - Party - PartyIdentification - ID | ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - ParentLocatedParty - PartyName - Name | ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndDate | ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndTime | ContractFolderStatus - TenderResult - AwardedTenderedProject - ProcurementProjectLotID | ContractFolderStatus - TenderingProcess - TenderSubmissionDeadlinePeriod | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
32 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: 3069_1/2021; Órgano de Contrata... | Construcción de depósito regulador de 100m3 en... | [2022-01-01 00:00:16.761000+00:00] | 3069_1/2021 | [EV] | <NA> | CONCEJO DE GALBARRA | Concejo de Galbarra | Construcción de depósito regulador de 100m3 en... | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | NaT |
19 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: 3069_1/2021; Órgano de Contrata... | Construcción de depósito regulador de 100m3 en... | [2022-01-01 07:30:09.698000+00:00] | 3069_1/2021 | [EV] | <NA> | CONCEJO DE GALBARRA | Concejo de Galbarra | Construcción de depósito regulador de 100m3 en... | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | NaT |
2 rows × 39 columns
True
if columns are different across elements
= first_group.iloc[0] != first_group.iloc[-1]
columns_are_different columns_are_different
id False
summary False
title False
updated True
ContractFolderStatus - ContractFolderID False
ContractFolderStatus - ContractFolderStatusCode False
ContractFolderStatus - LocatedContractingParty - BuyerProfileURIID True
ContractFolderStatus - LocatedContractingParty - Party - PartyName - Name False
ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - PartyName - Name False
ContractFolderStatus - ProcurementProject - Name False
ContractFolderStatus - ProcurementProject - TypeCode False
ContractFolderStatus - ProcurementProject - BudgetAmount - EstimatedOverallContractAmount False
ContractFolderStatus - ProcurementProject - BudgetAmount - TaxExclusiveAmount False
ContractFolderStatus - ProcurementProject - RequiredCommodityClassification - ItemClassificationCode False
ContractFolderStatus - ProcurementProject - RealizedLocation - CountrySubentityCode False
ContractFolderStatus - ProcurementProject - PlannedPeriod - DurationMeasure False
ContractFolderStatus - TenderResult - ResultCode True
ContractFolderStatus - TenderResult - ReceivedTenderQuantity True
ContractFolderStatus - TenderResult - WinningParty - PartyIdentification - ID True
ContractFolderStatus - TenderResult - WinningParty - PartyName - Name True
ContractFolderStatus - TenderResult - AwardedTenderedProject - LegalMonetaryTotal - TaxExclusiveAmount True
ContractFolderStatus - TenderingProcess - ProcedureCode False
ContractFolderStatus - TenderingProcess - TenderSubmissionDeadlinePeriod - EndDate True
ContractFolderStatus - TenderingProcess - TenderSubmissionDeadlinePeriod - EndTime True
ContractFolderStatus - ValidNoticeInfo - NoticeTypeCode False
ContractFolderStatus - ValidNoticeInfo - AdditionalPublicationStatus - PublicationMediaName False
ContractFolderStatus - ValidNoticeInfo - AdditionalPublicationStatus - AdditionalPublicationDocumentReference - IssueDate True
ContractFolderStatus - LegalDocumentReference - ID True
ContractFolderStatus - LegalDocumentReference - Attachment - ExternalReference - URI True
ContractFolderStatus - TechnicalDocumentReference - ID True
ContractFolderStatus - TechnicalDocumentReference - Attachment - ExternalReference - URI True
ContractFolderStatus - ProcurementProject - PlannedPeriod - StartDate True
ContractFolderStatus - ProcurementProject - PlannedPeriod - EndDate True
ContractFolderStatus - LocatedContractingParty - Party - PartyIdentification - ID True
ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - ParentLocatedParty - PartyName - Name True
ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndDate True
ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndTime True
ContractFolderStatus - TenderResult - AwardedTenderedProject - ProcurementProjectLotID True
ContractFolderStatus - TenderingProcess - TenderSubmissionDeadlinePeriod True
dtype: bool
The values in those columns are (besides updated) all nan
s
first_group[first_group.columns[columns_are_different]]
updated | ContractFolderStatus - LocatedContractingParty - BuyerProfileURIID | ContractFolderStatus - TenderResult - ResultCode | ContractFolderStatus - TenderResult - ReceivedTenderQuantity | ContractFolderStatus - TenderResult - WinningParty - PartyIdentification - ID | ContractFolderStatus - TenderResult - WinningParty - PartyName - Name | ContractFolderStatus - TenderResult - AwardedTenderedProject - LegalMonetaryTotal - TaxExclusiveAmount | ContractFolderStatus - TenderingProcess - TenderSubmissionDeadlinePeriod - EndDate | ContractFolderStatus - TenderingProcess - TenderSubmissionDeadlinePeriod - EndTime | ContractFolderStatus - ValidNoticeInfo - AdditionalPublicationStatus - AdditionalPublicationDocumentReference - IssueDate | ... | ContractFolderStatus - TechnicalDocumentReference - ID | ContractFolderStatus - TechnicalDocumentReference - Attachment - ExternalReference - URI | ContractFolderStatus - ProcurementProject - PlannedPeriod - StartDate | ContractFolderStatus - ProcurementProject - PlannedPeriod - EndDate | ContractFolderStatus - LocatedContractingParty - Party - PartyIdentification - ID | ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - ParentLocatedParty - PartyName - Name | ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndDate | ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndTime | ContractFolderStatus - TenderResult - AwardedTenderedProject - ProcurementProjectLotID | ContractFolderStatus - TenderingProcess - TenderSubmissionDeadlinePeriod | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
32 | [2022-01-01 00:00:16.761000+00:00] | <NA> | NaN | NaN | NaN | NaN | NaN | <NA> | <NA> | [[2021-12-16, 2021-12-16, 2021-12-16, 2021-12-... | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | NaT |
19 | [2022-01-01 07:30:09.698000+00:00] | <NA> | NaN | NaN | NaN | NaN | NaN | <NA> | <NA> | [[2021-12-16, 2021-12-16, 2021-12-16, 2021-12-... | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | NaT |
2 rows × 22 columns
Only the last element of each group (the most recent entry) is kept
= grouped.last()
only_last_update_df only_last_update_df.head()
summary | title | updated | ContractFolderStatus - ContractFolderID | ContractFolderStatus - ContractFolderStatusCode | ContractFolderStatus - LocatedContractingParty - BuyerProfileURIID | ContractFolderStatus - LocatedContractingParty - Party - PartyName - Name | ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - PartyName - Name | ContractFolderStatus - ProcurementProject - Name | ContractFolderStatus - ProcurementProject - TypeCode | ... | ContractFolderStatus - TechnicalDocumentReference - ID | ContractFolderStatus - TechnicalDocumentReference - Attachment - ExternalReference - URI | ContractFolderStatus - ProcurementProject - PlannedPeriod - StartDate | ContractFolderStatus - ProcurementProject - PlannedPeriod - EndDate | ContractFolderStatus - LocatedContractingParty - Party - PartyIdentification - ID | ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - ParentLocatedParty - PartyName - Name | ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndDate | ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndTime | ContractFolderStatus - TenderResult - AwardedTenderedProject - ProcurementProjectLotID | ContractFolderStatus - TenderingProcess - TenderSubmissionDeadlinePeriod | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||
https://contrataciondelestado.es/sindicacion/P... | Id licitación: 6011900114;Órgano de Contrataci... | Servicio de migración de los productos BMC ins... | [2021-12-31 10:56:15.855000+00:00] | 6011900114 | [RES] | http://www.madrid.org/cs/Satellite?cid=1204201... | Empresa Pública de Metro de Madrid, S.A. | Consejería de Transportes e Infraestructuras | Servicio de migración de los productos BMC ins... | 2.0 | ... | 1354764548025.pdf | http://www.madrid.org/contratos-publicos/13547... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | None | NaT |
https://contrataciondelestado.es/sindicacion/P... | Id licitación: 11/2019; Ó“rgano de Contratació... | Concurso de proyectos para la rehabilitación u... | [2021-12-31 10:41:20.201000+00:00] | 11/2019 | [EV] | <NA> | Alcaldia | Ayuntamiento de Leioa | Concurso de proyectos para la rehabilitación u... | 2.0 | ... | <NA> | <NA> | <NA> | <NA> | <NA> | Ayuntamiento de Leioa | <NA> | <NA> | None | 2020-01-23 17:00:00+00:00 |
https://contrataciondelestado.es/sindicacion/P... | Id licitación: 6012000208;Órgano de Contrataci... | Suministro de equipos de manutención (carretil... | [2021-12-31 11:11:15.851000+00:00] | 6012000208 | [RES] | http://www.madrid.org/cs/Satellite?cid=1204201... | Empresa Pública de Metro de Madrid, S.A. | Consejería de Transportes e Infraestructuras | Suministro de equipos de manutención (carretil... | 1.0 | ... | 1354836373815.pdf | http://www.madrid.org/contratos-publicos/13548... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | None | NaT |
https://contrataciondelestado.es/sindicacion/P... | Id licitación: 6012100119;Órgano de Contrataci... | Servicio de soporte al mantenimiento de instal... | [2021-12-31 10:11:16.027000+00:00] | 6012100119 | [RES] | http://www.madrid.org/cs/Satellite?cid=1204201... | Empresa Pública de Metro de Madrid, S.A. | Consejería de Transportes e Infraestructuras | Servicio de soporte al mantenimiento de instal... | 2.0 | ... | 1354874714036.pdf | http://www.madrid.org/contratos-publicos/13548... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | None | NaT |
https://contrataciondelestado.es/sindicacion/P... | Id Licitación: PcPG/2021/801758, Órgano de Con... | Servicio para la caracterización hidromorfolóx... | [2022-01-02 22:11:15.096000+00:00] | PcPG/2021/801758 | [RES] | https://www.contratosdegalicia.gal//consultaOr... | Ente Público Empresarial Augas de Galicia | <NA> | Servicio para la caracterización hidromorfolóx... | 2.0 | ... | <NA> | <NA> | <NA> | <NA> | A12024974 | <NA> | <NA> | <NA> | None | 2021-06-03 14:00:00+00:00 |
5 rows × 38 columns
The first group with more than one element
= grouped.get_group(size_2_groups[0])
a_group_df a_group_df
id | summary | title | updated | ContractFolderStatus - ContractFolderID | ContractFolderStatus - ContractFolderStatusCode | ContractFolderStatus - LocatedContractingParty - BuyerProfileURIID | ContractFolderStatus - LocatedContractingParty - Party - PartyName - Name | ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - PartyName - Name | ContractFolderStatus - ProcurementProject - Name | ... | ContractFolderStatus - TechnicalDocumentReference - ID | ContractFolderStatus - TechnicalDocumentReference - Attachment - ExternalReference - URI | ContractFolderStatus - ProcurementProject - PlannedPeriod - StartDate | ContractFolderStatus - ProcurementProject - PlannedPeriod - EndDate | ContractFolderStatus - LocatedContractingParty - Party - PartyIdentification - ID | ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - ParentLocatedParty - PartyName - Name | ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndDate | ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndTime | ContractFolderStatus - TenderResult - AwardedTenderedProject - ProcurementProjectLotID | ContractFolderStatus - TenderingProcess - TenderSubmissionDeadlinePeriod | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
32 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: 3069_1/2021; Órgano de Contrata... | Construcción de depósito regulador de 100m3 en... | [2022-01-01 00:00:16.761000+00:00] | 3069_1/2021 | [EV] | <NA> | CONCEJO DE GALBARRA | Concejo de Galbarra | Construcción de depósito regulador de 100m3 en... | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | NaT |
19 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: 3069_1/2021; Órgano de Contrata... | Construcción de depósito regulador de 100m3 en... | [2022-01-01 07:30:09.698000+00:00] | 3069_1/2021 | [EV] | <NA> | CONCEJO DE GALBARRA | Concejo de Galbarra | Construcción de depósito regulador de 100m3 en... | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | NaT |
2 rows × 39 columns
The columns whose history is to be kept
For the sake of efficiency
A list (of lists) with the columns that must be kept when keeping only updates
A function which keeps, for every procurement entry, only the last update.
keep_updates_only
keep_updates_only (df:pandas.core.frame.DataFrame)
Keep only the last update for every collection of entries with the same id
Type | Details | |
---|---|---|
df | DataFrame | Input |
Returns | DataFrame | Output |
= keep_updates_only(post_df)
only_updated_df only_updated_df.head()
id | index | summary | title | ContractFolderStatus - ContractFolderID | ContractFolderStatus - LocatedContractingParty - BuyerProfileURIID | ContractFolderStatus - LocatedContractingParty - Party - PartyName - Name | ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - PartyName - Name | ContractFolderStatus - ProcurementProject - Name | ContractFolderStatus - ProcurementProject - TypeCode | ... | ContractFolderStatus - ProcurementProject - PlannedPeriod - StartDate | ContractFolderStatus - ProcurementProject - PlannedPeriod - EndDate | ContractFolderStatus - LocatedContractingParty - Party - PartyIdentification - ID | ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - ParentLocatedParty - PartyName - Name | ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndDate | ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndTime | ContractFolderStatus - TenderResult - AwardedTenderedProject - ProcurementProjectLotID | ContractFolderStatus - TenderingProcess - TenderSubmissionDeadlinePeriod | updated | ContractFolderStatus - ContractFolderStatusCode | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | https://contrataciondelestado.es/sindicacion/P... | 116 | Id licitación: 0001264/2021; Órgano de contrat... | 2021/pa-44-4 servicio de mantenimiento integra... | 0001264/2021 | <NA> | Agencia Pública Empresarial Sanitaria Bajo Gua... | Junta de Andalucía | 2021/pa-44-4 servicio de mantenimiento integra... | 2.0 | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | 2021-11-11 15:00:00+00:00 | [2021-12-30 23:14:15.739000+00:00] | [RES] |
1 | https://contrataciondelestado.es/sindicacion/P... | 115 | Id licitación: 8113_01 2021; Órgano de Contrat... | Contrato del Servicio de Teleasistencia para l... | 8113_01 2021 | <NA> | Agencia Navarra de Autonomía y Desarrollo de l... | Agencia Navarra para la Dependencia | Contrato del Servicio de Teleasistencia para l... | 2.0 | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | NaT | [2021-12-31 00:00:12.604000+00:00] | [EV] |
2 | https://contrataciondelestado.es/sindicacion/P... | 114 | Id licitación: 8113_3/2021; Órgano de Contrata... | Contrato de servicios de desinfección, desinse... | 8113_3/2021 | <NA> | Subdirector de Gestión y Recursos | Agencia Navarra para la Dependencia | Contrato de servicios de desinfección, desinse... | 2.0 | ... | 2022-01-01 | 2022-12-31 | <NA> | <NA> | <NA> | <NA> | NaN | NaT | [2021-12-31 00:00:13.594000+00:00] | [EV] |
3 | https://contrataciondelestado.es/sindicacion/P... | 113 | Id licitación: 8165_3/2021; Órgano de Contrata... | Asistencia técnica para la prestación del serv... | 8165_3/2021 | <NA> | Mancomunidad de Servicios Sociales de Base de ... | MANCOMUNIDAD DE SERVICIOS DE HUARTE Y DE ESTER... | Asistencia técnica para la prestación del serv... | 2.0 | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | NaT | [2021-12-31 00:00:14.393000+00:00] | [EV] |
4 | https://contrataciondelestado.es/sindicacion/P... | 112 | Id licitación: 1005_391-2021; Órgano de Contra... | Apoyo a la gestión del patrimonio filmográfico... | 1005_391-2021 | <NA> | Dirección General de Cultura-Institución Prínc... | Departamento de Cultura, Deporte y Juventud | Apoyo a la gestión del patrimonio filmográfico... | 2.0 | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | 2022-01-14 23:59:00+00:00 | [2021-12-31 00:00:14.946000+00:00] | [PUB] |
5 rows × 40 columns
len(only_updated_df)
115
only_updated_df.index
RangeIndex(start=0, stop=115, step=1)
One of the groups (now an individual row) that had more than one row
'id'] == size_2_groups[0]][assembled_historical_cols] only_updated_df[only_updated_df[
updated | ContractFolderStatus - ContractFolderStatusCode | |
---|---|---|
95 | [2022-01-01 00:00:16.761000+00:00, 2022-01-01 ... | [EV, EV] |
Deleted series
import sproc.bundle
= directory / 'yearly' / 'PlataformasAgregadasSinMenores_2018.zip'
zip_file assert zip_file.exists()
zip_file
PosixPath('/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples/yearly/PlataformasAgregadasSinMenores_2018.zip')
= sproc.bundle.read_deleted_zip(zip_file)
deleted_series 2) deleted_series.head(
zip file name id
PlataformasAgregadasSinMenores_2018.zip PlataformasAgregadasSinMenores_20180217_180137_1.atom https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1985903 2018-01-04 13:11:18.021000+00:00
https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1969197 2018-01-04 13:11:17.921000+00:00
Name: deleted_on, dtype: datetime64[ns, UTC]
No duplicates
1).duplicated().any() deleted_series.index.get_level_values(
True
We add an artificial one
= pd.Timestamp(year=2017, month=1, day=3, tz=deleted_series[-1].tz)
new_data 'foo', deleted_series.index[-1][1], deleted_series.index[-1][2]] = new_data deleted_series.loc[
Now have a duplicate (at level 1, i.e., id
)
2).duplicated().any() deleted_series.index.get_level_values(
True
A function to get rid of duplicates from a deleted series. The oldest entry is kept.
deduplicate_deleted_series
deduplicate_deleted_series (series:pandas.core.series.Series)
= deduplicate_deleted_series(deleted_series)
deduplicated_series 2) deduplicated_series.tail(
zip file name id
PlataformasAgregadasSinMenores_2018.zip PlataformasAgregadasSinMenores_20180217_190110_1.atom https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1996094 2018-02-07 09:01:16.378000+00:00
foo PlataformasAgregadasSinMenores_20180217_190110_1.atom https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/2000163 2017-01-03 00:00:00+00:00
Name: deleted_on, dtype: datetime64[ns, UTC]
assert not deduplicated_series.index.get_level_values(2).duplicated().any()
deleted_series.shape, deduplicated_series.shape
((52,), (51,))