postprocess

Functionality to post-process already parsed data

In order to avoid circular dependencies in the resulting Python modules, and since this is only for testing

import sproc.xml

Sample data

Directory where the data (XML files) are stored

directory = pathlib.Path.cwd().parent / 'samples'
assert directory.exists()
directory
PosixPath('/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples')

A (sample) file in that directory

xml_file = directory / 'PlataformasAgregadasSinMenores_20220104_030016_1.atom'
assert xml_file.exists()
xml_file
PosixPath('/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples/PlataformasAgregadasSinMenores_20220104_030016_1.atom')
df = sproc.xml.to_df(xml_file)
df
id summary title updated ContractFolderStatus - ContractFolderID ContractFolderStatus - ContractFolderStatusCode ContractFolderStatus - LocatedContractingParty - BuyerProfileURIID ContractFolderStatus - LocatedContractingParty - Party - PartyName - Name ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - PartyName - Name ContractFolderStatus - ProcurementProject - Name ... ContractFolderStatus - LegalDocumentReference - Attachment - ExternalReference - URI ContractFolderStatus - TechnicalDocumentReference - ID ContractFolderStatus - TechnicalDocumentReference - Attachment - ExternalReference - URI ContractFolderStatus - ProcurementProject - PlannedPeriod - StartDate ContractFolderStatus - ProcurementProject - PlannedPeriod - EndDate ContractFolderStatus - LocatedContractingParty - Party - PartyIdentification - ID ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - ParentLocatedParty - PartyName - Name ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndDate ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndTime ContractFolderStatus - TenderResult - AwardedTenderedProject - ProcurementProjectLotID
0 https://contrataciondelestado.es/sindicacion/P... Id licitación: C. 2-2021; Órgano de Contrataci... L'objecte del contracte és la renovació de tot... 2022-01-03T01:11:41.826+01:00 C. 2-2021 ADJ https://contractaciopublica.gencat.cat/ecofin_... Ajuntament de Sant Ramon Entitats municipals de Catalunya L'objecte del contracte és la renovació de tot... ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 https://contrataciondelestado.es/sindicacion/P... Id licitación: 8128_3/2021; Órgano de Contrata... Obras de restauración hidromorfológica del río... 2022-01-03T01:00:11.194+01:00 8128_3/2021 PUB NaN Pleno del Ayuntamiento AYUNTAMIENTO DE MONREAL Obras de restauración hidromorfológica del río... ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 https://contrataciondelestado.es/sindicacion/P... Id licitación: 1000_0005-CP01-2021-000063; Órg... Contrato del servicio de realización de labore... 2022-01-03T01:00:10.399+01:00 1000_0005-CP01-2021-000063 EV NaN El Director General de Comunicación y Relacion... Departamento de Presidencia, Igualdad, Función... Contrato del servicio de realización de labore... ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 https://contrataciondelestado.es/sindicacion/P... Id licitación: 1379/2020 4738; Órgano de Contr... Obres de renovació de l'enllumenat públic a la... 2022-01-03T00:11:40.740+01:00 1379/2020 4738 EV https://contractaciopublica.gencat.cat/ecofin_... Ajuntament de Canet de Mar Entitats municipals de Catalunya Obres de renovació de l'enllumenat públic a la... ... https://contractaciopublica.gencat.cat/ecofin_... NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 https://contrataciondelestado.es/sindicacion/P... Id licitación: 2021-44; Órgano de Contratación... Subministre i la instal·lació fotovoltaica en ... 2022-01-03T00:11:40.696+01:00 2021-44 EV https://contractaciopublica.gencat.cat/ecofin_... Ajuntament de Valls Entitats municipals de Catalunya Subministre i la instal·lació fotovoltaica en ... ... https://contractaciopublica.gencat.cat/ecofin_... Enllac plec clausules tecniques.doc https://contractaciopublica.gencat.cat/ecofin_... NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
112 https://contrataciondelestado.es/sindicacion/P... Id licitación: 1005_391-2021; Órgano de Contra... Apoyo a la gestión del patrimonio filmográfico... 2021-12-31T01:00:14.946+01:00 1005_391-2021 PUB NaN Dirección General de Cultura-Institución Prínc... Departamento de Cultura, Deporte y Juventud Apoyo a la gestión del patrimonio filmográfico... ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
113 https://contrataciondelestado.es/sindicacion/P... Id licitación: 8165_3/2021; Órgano de Contrata... Asistencia técnica para la prestación del serv... 2021-12-31T01:00:14.393+01:00 8165_3/2021 EV NaN Mancomunidad de Servicios Sociales de Base de ... MANCOMUNIDAD DE SERVICIOS DE HUARTE Y DE ESTER... Asistencia técnica para la prestación del serv... ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
114 https://contrataciondelestado.es/sindicacion/P... Id licitación: 8113_3/2021; Órgano de Contrata... Contrato de servicios de desinfección, desinse... 2021-12-31T01:00:13.594+01:00 8113_3/2021 EV NaN Subdirector de Gestión y Recursos Agencia Navarra para la Dependencia Contrato de servicios de desinfección, desinse... ... NaN NaN NaN 2022-01-01 2022-12-31 NaN NaN NaN NaN NaN
115 https://contrataciondelestado.es/sindicacion/P... Id licitación: 8113_01 2021; Órgano de Contrat... Contrato del Servicio de Teleasistencia para l... 2021-12-31T01:00:12.604+01:00 8113_01 2021 EV NaN Agencia Navarra de Autonomía y Desarrollo de l... Agencia Navarra para la Dependencia Contrato del Servicio de Teleasistencia para l... ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
116 https://contrataciondelestado.es/sindicacion/P... Id licitación: 0001264/2021; Órgano de contrat... 2021/pa-44-4 servicio de mantenimiento integra... 2021-12-31T00:14:15.739+01:00 0001264/2021 RES NaN Agencia Pública Empresarial Sanitaria Bajo Gua... Junta de Andalucía 2021/pa-44-4 servicio de mantenimiento integra... ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

117 rows × 38 columns

There are some multivalued columns

sproc.structure.multivalued_columns(df)
['ContractFolderStatus - ProcurementProject - RequiredCommodityClassification - ItemClassificationCode',
 'ContractFolderStatus - TenderResult - ResultCode',
 'ContractFolderStatus - TenderResult - ReceivedTenderQuantity',
 'ContractFolderStatus - TenderResult - WinningParty - PartyIdentification - ID',
 'ContractFolderStatus - TenderResult - WinningParty - PartyName - Name',
 'ContractFolderStatus - TenderResult - AwardedTenderedProject - LegalMonetaryTotal - TaxExclusiveAmount',
 'ContractFolderStatus - ValidNoticeInfo - NoticeTypeCode',
 'ContractFolderStatus - ValidNoticeInfo - AdditionalPublicationStatus - PublicationMediaName',
 'ContractFolderStatus - ValidNoticeInfo - AdditionalPublicationStatus - AdditionalPublicationDocumentReference - IssueDate',
 'ContractFolderStatus - TenderResult - AwardedTenderedProject - ProcurementProjectLotID']

Columns’ types

A regular expression matching any column whose final component is PostalZone

For instance

re_postal_zone.match('foo - PostalZone')
<re.Match object; span=(0, 16), match='foo - PostalZone'>

but

re_postal_zone.match('Address - Number')

Two positives

assert re_postal_zone.match(sproc.structure.assemble_name(['ContractFolderStatus', 'ProcurementProject', 'RealizedLocation', 'Address', 'PostalZone']))
assert re_postal_zone.match(sproc.structure.assemble_name(['ContractFolderStatus', 'LocatedContractingParty', 'Party', 'PostalAddress', 'PostalZone']))

A negative

assert not re_postal_zone.match(sproc.structure.assemble_name(['ContractFolderStatus', 'Wap']))

The list below specifies fields that are to be interpreted as str

str_columns[:2]
[]

For easier processing, the lists are turned into str

assembled_str_columns
[]

These columns, if present,

are parsed jointly into a new column

The column indicating the status is also gonna be exploited

A function to tidy up some things in the pd.DataFrames returned by sproc.xml.to_df


source

typecast_columns

 typecast_columns (input_df:pandas.core.frame.DataFrame)

Tidy up the pd.DataFrame returned by to_df

Type Details
input_df DataFrame Input DataFrame as ready by to_df
Returns DataFrame Post-processed DataFrame
post_df = typecast_columns(df)
post_df.head()
id summary title updated ContractFolderStatus - ContractFolderID ContractFolderStatus - ContractFolderStatusCode ContractFolderStatus - LocatedContractingParty - BuyerProfileURIID ContractFolderStatus - LocatedContractingParty - Party - PartyName - Name ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - PartyName - Name ContractFolderStatus - ProcurementProject - Name ... ContractFolderStatus - TechnicalDocumentReference - ID ContractFolderStatus - TechnicalDocumentReference - Attachment - ExternalReference - URI ContractFolderStatus - ProcurementProject - PlannedPeriod - StartDate ContractFolderStatus - ProcurementProject - PlannedPeriod - EndDate ContractFolderStatus - LocatedContractingParty - Party - PartyIdentification - ID ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - ParentLocatedParty - PartyName - Name ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndDate ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndTime ContractFolderStatus - TenderResult - AwardedTenderedProject - ProcurementProjectLotID ContractFolderStatus - TenderingProcess - TenderSubmissionDeadlinePeriod
0 https://contrataciondelestado.es/sindicacion/P... Id licitación: C. 2-2021; Órgano de Contrataci... L'objecte del contracte és la renovació de tot... [2022-01-03 00:11:41.826000+00:00] C. 2-2021 [ADJ] https://contractaciopublica.gencat.cat/ecofin_... Ajuntament de Sant Ramon Entitats municipals de Catalunya L'objecte del contracte és la renovació de tot... ... <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> NaN 2021-12-17 14:00:00+00:00
1 https://contrataciondelestado.es/sindicacion/P... Id licitación: 8128_3/2021; Órgano de Contrata... Obras de restauración hidromorfológica del río... [2022-01-03 00:00:11.194000+00:00] 8128_3/2021 [PUB] <NA> Pleno del Ayuntamiento AYUNTAMIENTO DE MONREAL Obras de restauración hidromorfológica del río... ... <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> NaN 2022-01-22 23:30:00+00:00
2 https://contrataciondelestado.es/sindicacion/P... Id licitación: 1000_0005-CP01-2021-000063; Órg... Contrato del servicio de realización de labore... [2022-01-03 00:00:10.399000+00:00] 1000_0005-CP01-2021-000063 [EV] <NA> El Director General de Comunicación y Relacion... Departamento de Presidencia, Igualdad, Función... Contrato del servicio de realización de labore... ... <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> NaN NaT
3 https://contrataciondelestado.es/sindicacion/P... Id licitación: 1379/2020 4738; Órgano de Contr... Obres de renovació de l'enllumenat públic a la... [2022-01-02 23:11:40.740000+00:00] 1379/2020 4738 [EV] https://contractaciopublica.gencat.cat/ecofin_... Ajuntament de Canet de Mar Entitats municipals de Catalunya Obres de renovació de l'enllumenat públic a la... ... <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> NaN 2022-01-02 23:59:00+00:00
4 https://contrataciondelestado.es/sindicacion/P... Id licitación: 2021-44; Órgano de Contratación... Subministre i la instal·lació fotovoltaica en ... [2022-01-02 23:11:40.696000+00:00] 2021-44 [EV] https://contractaciopublica.gencat.cat/ecofin_... Ajuntament de Valls Entitats municipals de Catalunya Subministre i la instal·lació fotovoltaica en ... ... Enllac plec clausules tecniques.doc https://contractaciopublica.gencat.cat/ecofin_... <NA> <NA> <NA> <NA> <NA> <NA> NaN 2022-01-02 23:59:00+00:00

5 rows × 39 columns

After post-processing some objects have been parsed as dates.

Multivalued columns are still multivalued…but we have a couple of extras one (updated and ContractFolderStatus - ContractFolderStatusCode )

# assert sproc.structure.multivalued_columns(post_df) == sproc.structure.multivalued_columns(df) # <---------------------
assert len(sproc.structure.multivalued_columns(post_df)) == len(sproc.structure.multivalued_columns(df)) + 2

Most recent update

updated, after post-processing is an object not apt for ordering

post_df['updated'].dtype
dtype('O')

We order the entries by updated date (ascending order) and then group by id. Notice that as a previous step to ordering, tha maximum of all the elements in updated is computed.

grouped = post_df.sort_values('updated', key=np.vectorize(max)).groupby('id')

We are interested in groups with more than one element

not_one_element_group = (grouped.size() > 1)
not_one_element_group
id
https://contrataciondelestado.es/sindicacion/P...    False
https://contrataciondelestado.es/sindicacion/P...    False
https://contrataciondelestado.es/sindicacion/P...    False
https://contrataciondelestado.es/sindicacion/P...    False
https://contrataciondelestado.es/sindicacion/P...    False
                                                     ...  
https://contrataciondelestado.es/sindicacion/P...    False
https://contrataciondelestado.es/sindicacion/P...    False
https://contrataciondelestado.es/sindicacion/P...    False
https://contrataciondelestado.es/sindicacion/P...    False
https://contrataciondelestado.es/sindicacion/P...    False
Length: 115, dtype: bool

Number of tenders with updates

not_one_element_group.sum()
2
actual_groups = not_one_element_group.index[not_one_element_group]
actual_groups
Index(['https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/8904280', 'https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/8994209'], dtype='string', name='id')

There is a group with this number of elements

grouped.size().max()
2

For getting groups with exactly a certain number of elements (here \(2\) again)

size_2_groups = (grouped.size() == 2).loc[lambda x: x].index
size_2_groups
Index(['https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/8904280', 'https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/8994209'], dtype='string', name='id')
first_group = grouped.get_group(actual_groups[0])
first_group
id summary title updated ContractFolderStatus - ContractFolderID ContractFolderStatus - ContractFolderStatusCode ContractFolderStatus - LocatedContractingParty - BuyerProfileURIID ContractFolderStatus - LocatedContractingParty - Party - PartyName - Name ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - PartyName - Name ContractFolderStatus - ProcurementProject - Name ... ContractFolderStatus - TechnicalDocumentReference - ID ContractFolderStatus - TechnicalDocumentReference - Attachment - ExternalReference - URI ContractFolderStatus - ProcurementProject - PlannedPeriod - StartDate ContractFolderStatus - ProcurementProject - PlannedPeriod - EndDate ContractFolderStatus - LocatedContractingParty - Party - PartyIdentification - ID ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - ParentLocatedParty - PartyName - Name ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndDate ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndTime ContractFolderStatus - TenderResult - AwardedTenderedProject - ProcurementProjectLotID ContractFolderStatus - TenderingProcess - TenderSubmissionDeadlinePeriod
32 https://contrataciondelestado.es/sindicacion/P... Id licitación: 3069_1/2021; Órgano de Contrata... Construcción de depósito regulador de 100m3 en... [2022-01-01 00:00:16.761000+00:00] 3069_1/2021 [EV] <NA> CONCEJO DE GALBARRA Concejo de Galbarra Construcción de depósito regulador de 100m3 en... ... <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> NaN NaT
19 https://contrataciondelestado.es/sindicacion/P... Id licitación: 3069_1/2021; Órgano de Contrata... Construcción de depósito regulador de 100m3 en... [2022-01-01 07:30:09.698000+00:00] 3069_1/2021 [EV] <NA> CONCEJO DE GALBARRA Concejo de Galbarra Construcción de depósito regulador de 100m3 en... ... <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> NaN NaT

2 rows × 39 columns

True if columns are different across elements

columns_are_different = first_group.iloc[0] != first_group.iloc[-1]
columns_are_different
id                                                                                                                           False
summary                                                                                                                      False
title                                                                                                                        False
updated                                                                                                                       True
ContractFolderStatus - ContractFolderID                                                                                      False
ContractFolderStatus - ContractFolderStatusCode                                                                              False
ContractFolderStatus - LocatedContractingParty - BuyerProfileURIID                                                            True
ContractFolderStatus - LocatedContractingParty - Party - PartyName - Name                                                    False
ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - PartyName - Name                                       False
ContractFolderStatus - ProcurementProject - Name                                                                             False
ContractFolderStatus - ProcurementProject - TypeCode                                                                         False
ContractFolderStatus - ProcurementProject - BudgetAmount - EstimatedOverallContractAmount                                    False
ContractFolderStatus - ProcurementProject - BudgetAmount - TaxExclusiveAmount                                                False
ContractFolderStatus - ProcurementProject - RequiredCommodityClassification - ItemClassificationCode                         False
ContractFolderStatus - ProcurementProject - RealizedLocation - CountrySubentityCode                                          False
ContractFolderStatus - ProcurementProject - PlannedPeriod - DurationMeasure                                                  False
ContractFolderStatus - TenderResult - ResultCode                                                                              True
ContractFolderStatus - TenderResult - ReceivedTenderQuantity                                                                  True
ContractFolderStatus - TenderResult - WinningParty - PartyIdentification - ID                                                 True
ContractFolderStatus - TenderResult - WinningParty - PartyName - Name                                                         True
ContractFolderStatus - TenderResult - AwardedTenderedProject - LegalMonetaryTotal - TaxExclusiveAmount                        True
ContractFolderStatus - TenderingProcess - ProcedureCode                                                                      False
ContractFolderStatus - TenderingProcess - TenderSubmissionDeadlinePeriod - EndDate                                            True
ContractFolderStatus - TenderingProcess - TenderSubmissionDeadlinePeriod - EndTime                                            True
ContractFolderStatus - ValidNoticeInfo - NoticeTypeCode                                                                      False
ContractFolderStatus - ValidNoticeInfo - AdditionalPublicationStatus - PublicationMediaName                                  False
ContractFolderStatus - ValidNoticeInfo - AdditionalPublicationStatus - AdditionalPublicationDocumentReference - IssueDate     True
ContractFolderStatus - LegalDocumentReference - ID                                                                            True
ContractFolderStatus - LegalDocumentReference - Attachment - ExternalReference - URI                                          True
ContractFolderStatus - TechnicalDocumentReference - ID                                                                        True
ContractFolderStatus - TechnicalDocumentReference - Attachment - ExternalReference - URI                                      True
ContractFolderStatus - ProcurementProject - PlannedPeriod - StartDate                                                         True
ContractFolderStatus - ProcurementProject - PlannedPeriod - EndDate                                                           True
ContractFolderStatus - LocatedContractingParty - Party - PartyIdentification - ID                                             True
ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - ParentLocatedParty - PartyName - Name                   True
ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndDate                                       True
ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndTime                                       True
ContractFolderStatus - TenderResult - AwardedTenderedProject - ProcurementProjectLotID                                        True
ContractFolderStatus - TenderingProcess - TenderSubmissionDeadlinePeriod                                                      True
dtype: bool

The values in those columns are (besides updated) all nans

first_group[first_group.columns[columns_are_different]]
updated ContractFolderStatus - LocatedContractingParty - BuyerProfileURIID ContractFolderStatus - TenderResult - ResultCode ContractFolderStatus - TenderResult - ReceivedTenderQuantity ContractFolderStatus - TenderResult - WinningParty - PartyIdentification - ID ContractFolderStatus - TenderResult - WinningParty - PartyName - Name ContractFolderStatus - TenderResult - AwardedTenderedProject - LegalMonetaryTotal - TaxExclusiveAmount ContractFolderStatus - TenderingProcess - TenderSubmissionDeadlinePeriod - EndDate ContractFolderStatus - TenderingProcess - TenderSubmissionDeadlinePeriod - EndTime ContractFolderStatus - ValidNoticeInfo - AdditionalPublicationStatus - AdditionalPublicationDocumentReference - IssueDate ... ContractFolderStatus - TechnicalDocumentReference - ID ContractFolderStatus - TechnicalDocumentReference - Attachment - ExternalReference - URI ContractFolderStatus - ProcurementProject - PlannedPeriod - StartDate ContractFolderStatus - ProcurementProject - PlannedPeriod - EndDate ContractFolderStatus - LocatedContractingParty - Party - PartyIdentification - ID ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - ParentLocatedParty - PartyName - Name ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndDate ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndTime ContractFolderStatus - TenderResult - AwardedTenderedProject - ProcurementProjectLotID ContractFolderStatus - TenderingProcess - TenderSubmissionDeadlinePeriod
32 [2022-01-01 00:00:16.761000+00:00] <NA> NaN NaN NaN NaN NaN <NA> <NA> [[2021-12-16, 2021-12-16, 2021-12-16, 2021-12-... ... <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> NaN NaT
19 [2022-01-01 07:30:09.698000+00:00] <NA> NaN NaN NaN NaN NaN <NA> <NA> [[2021-12-16, 2021-12-16, 2021-12-16, 2021-12-... ... <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> NaN NaT

2 rows × 22 columns

Only the last element of each group (the most recent entry) is kept

only_last_update_df = grouped.last()
only_last_update_df.head()
summary title updated ContractFolderStatus - ContractFolderID ContractFolderStatus - ContractFolderStatusCode ContractFolderStatus - LocatedContractingParty - BuyerProfileURIID ContractFolderStatus - LocatedContractingParty - Party - PartyName - Name ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - PartyName - Name ContractFolderStatus - ProcurementProject - Name ContractFolderStatus - ProcurementProject - TypeCode ... ContractFolderStatus - TechnicalDocumentReference - ID ContractFolderStatus - TechnicalDocumentReference - Attachment - ExternalReference - URI ContractFolderStatus - ProcurementProject - PlannedPeriod - StartDate ContractFolderStatus - ProcurementProject - PlannedPeriod - EndDate ContractFolderStatus - LocatedContractingParty - Party - PartyIdentification - ID ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - ParentLocatedParty - PartyName - Name ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndDate ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndTime ContractFolderStatus - TenderResult - AwardedTenderedProject - ProcurementProjectLotID ContractFolderStatus - TenderingProcess - TenderSubmissionDeadlinePeriod
id
https://contrataciondelestado.es/sindicacion/P... Id licitación: 6011900114;Órgano de Contrataci... Servicio de migración de los productos BMC ins... [2021-12-31 10:56:15.855000+00:00] 6011900114 [RES] http://www.madrid.org/cs/Satellite?cid=1204201... Empresa Pública de Metro de Madrid, S.A. Consejería de Transportes e Infraestructuras Servicio de migración de los productos BMC ins... 2.0 ... 1354764548025.pdf http://www.madrid.org/contratos-publicos/13547... <NA> <NA> <NA> <NA> <NA> <NA> None NaT
https://contrataciondelestado.es/sindicacion/P... Id licitación: 11/2019; Ó“rgano de Contratació... Concurso de proyectos para la rehabilitación u... [2021-12-31 10:41:20.201000+00:00] 11/2019 [EV] <NA> Alcaldia Ayuntamiento de Leioa Concurso de proyectos para la rehabilitación u... 2.0 ... <NA> <NA> <NA> <NA> <NA> Ayuntamiento de Leioa <NA> <NA> None 2020-01-23 17:00:00+00:00
https://contrataciondelestado.es/sindicacion/P... Id licitación: 6012000208;Órgano de Contrataci... Suministro de equipos de manutención (carretil... [2021-12-31 11:11:15.851000+00:00] 6012000208 [RES] http://www.madrid.org/cs/Satellite?cid=1204201... Empresa Pública de Metro de Madrid, S.A. Consejería de Transportes e Infraestructuras Suministro de equipos de manutención (carretil... 1.0 ... 1354836373815.pdf http://www.madrid.org/contratos-publicos/13548... <NA> <NA> <NA> <NA> <NA> <NA> None NaT
https://contrataciondelestado.es/sindicacion/P... Id licitación: 6012100119;Órgano de Contrataci... Servicio de soporte al mantenimiento de instal... [2021-12-31 10:11:16.027000+00:00] 6012100119 [RES] http://www.madrid.org/cs/Satellite?cid=1204201... Empresa Pública de Metro de Madrid, S.A. Consejería de Transportes e Infraestructuras Servicio de soporte al mantenimiento de instal... 2.0 ... 1354874714036.pdf http://www.madrid.org/contratos-publicos/13548... <NA> <NA> <NA> <NA> <NA> <NA> None NaT
https://contrataciondelestado.es/sindicacion/P... Id Licitación: PcPG/2021/801758, Órgano de Con... Servicio para la caracterización hidromorfolóx... [2022-01-02 22:11:15.096000+00:00] PcPG/2021/801758 [RES] https://www.contratosdegalicia.gal//consultaOr... Ente Público Empresarial Augas de Galicia <NA> Servicio para la caracterización hidromorfolóx... 2.0 ... <NA> <NA> <NA> <NA> A12024974 <NA> <NA> <NA> None 2021-06-03 14:00:00+00:00

5 rows × 38 columns

The first group with more than one element

a_group_df = grouped.get_group(size_2_groups[0])
a_group_df
id summary title updated ContractFolderStatus - ContractFolderID ContractFolderStatus - ContractFolderStatusCode ContractFolderStatus - LocatedContractingParty - BuyerProfileURIID ContractFolderStatus - LocatedContractingParty - Party - PartyName - Name ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - PartyName - Name ContractFolderStatus - ProcurementProject - Name ... ContractFolderStatus - TechnicalDocumentReference - ID ContractFolderStatus - TechnicalDocumentReference - Attachment - ExternalReference - URI ContractFolderStatus - ProcurementProject - PlannedPeriod - StartDate ContractFolderStatus - ProcurementProject - PlannedPeriod - EndDate ContractFolderStatus - LocatedContractingParty - Party - PartyIdentification - ID ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - ParentLocatedParty - PartyName - Name ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndDate ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndTime ContractFolderStatus - TenderResult - AwardedTenderedProject - ProcurementProjectLotID ContractFolderStatus - TenderingProcess - TenderSubmissionDeadlinePeriod
32 https://contrataciondelestado.es/sindicacion/P... Id licitación: 3069_1/2021; Órgano de Contrata... Construcción de depósito regulador de 100m3 en... [2022-01-01 00:00:16.761000+00:00] 3069_1/2021 [EV] <NA> CONCEJO DE GALBARRA Concejo de Galbarra Construcción de depósito regulador de 100m3 en... ... <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> NaN NaT
19 https://contrataciondelestado.es/sindicacion/P... Id licitación: 3069_1/2021; Órgano de Contrata... Construcción de depósito regulador de 100m3 en... [2022-01-01 07:30:09.698000+00:00] 3069_1/2021 [EV] <NA> CONCEJO DE GALBARRA Concejo de Galbarra Construcción de depósito regulador de 100m3 en... ... <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> NaN NaT

2 rows × 39 columns

The columns whose history is to be kept

For the sake of efficiency

A list (of lists) with the columns that must be kept when keeping only updates

A function which keeps, for every procurement entry, only the last update.


source

keep_updates_only

 keep_updates_only (df:pandas.core.frame.DataFrame)

Keep only the last update for every collection of entries with the same id

Type Details
df DataFrame Input
Returns DataFrame Output
only_updated_df = keep_updates_only(post_df)
only_updated_df.head()
id index summary title ContractFolderStatus - ContractFolderID ContractFolderStatus - LocatedContractingParty - BuyerProfileURIID ContractFolderStatus - LocatedContractingParty - Party - PartyName - Name ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - PartyName - Name ContractFolderStatus - ProcurementProject - Name ContractFolderStatus - ProcurementProject - TypeCode ... ContractFolderStatus - ProcurementProject - PlannedPeriod - StartDate ContractFolderStatus - ProcurementProject - PlannedPeriod - EndDate ContractFolderStatus - LocatedContractingParty - Party - PartyIdentification - ID ContractFolderStatus - LocatedContractingParty - ParentLocatedParty - ParentLocatedParty - PartyName - Name ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndDate ContractFolderStatus - TenderingProcess - ParticipationRequestReceptionPeriod - EndTime ContractFolderStatus - TenderResult - AwardedTenderedProject - ProcurementProjectLotID ContractFolderStatus - TenderingProcess - TenderSubmissionDeadlinePeriod updated ContractFolderStatus - ContractFolderStatusCode
0 https://contrataciondelestado.es/sindicacion/P... 116 Id licitación: 0001264/2021; Órgano de contrat... 2021/pa-44-4 servicio de mantenimiento integra... 0001264/2021 <NA> Agencia Pública Empresarial Sanitaria Bajo Gua... Junta de Andalucía 2021/pa-44-4 servicio de mantenimiento integra... 2.0 ... <NA> <NA> <NA> <NA> <NA> <NA> NaN 2021-11-11 15:00:00+00:00 [2021-12-30 23:14:15.739000+00:00] [RES]
1 https://contrataciondelestado.es/sindicacion/P... 115 Id licitación: 8113_01 2021; Órgano de Contrat... Contrato del Servicio de Teleasistencia para l... 8113_01 2021 <NA> Agencia Navarra de Autonomía y Desarrollo de l... Agencia Navarra para la Dependencia Contrato del Servicio de Teleasistencia para l... 2.0 ... <NA> <NA> <NA> <NA> <NA> <NA> NaN NaT [2021-12-31 00:00:12.604000+00:00] [EV]
2 https://contrataciondelestado.es/sindicacion/P... 114 Id licitación: 8113_3/2021; Órgano de Contrata... Contrato de servicios de desinfección, desinse... 8113_3/2021 <NA> Subdirector de Gestión y Recursos Agencia Navarra para la Dependencia Contrato de servicios de desinfección, desinse... 2.0 ... 2022-01-01 2022-12-31 <NA> <NA> <NA> <NA> NaN NaT [2021-12-31 00:00:13.594000+00:00] [EV]
3 https://contrataciondelestado.es/sindicacion/P... 113 Id licitación: 8165_3/2021; Órgano de Contrata... Asistencia técnica para la prestación del serv... 8165_3/2021 <NA> Mancomunidad de Servicios Sociales de Base de ... MANCOMUNIDAD DE SERVICIOS DE HUARTE Y DE ESTER... Asistencia técnica para la prestación del serv... 2.0 ... <NA> <NA> <NA> <NA> <NA> <NA> NaN NaT [2021-12-31 00:00:14.393000+00:00] [EV]
4 https://contrataciondelestado.es/sindicacion/P... 112 Id licitación: 1005_391-2021; Órgano de Contra... Apoyo a la gestión del patrimonio filmográfico... 1005_391-2021 <NA> Dirección General de Cultura-Institución Prínc... Departamento de Cultura, Deporte y Juventud Apoyo a la gestión del patrimonio filmográfico... 2.0 ... <NA> <NA> <NA> <NA> <NA> <NA> NaN 2022-01-14 23:59:00+00:00 [2021-12-31 00:00:14.946000+00:00] [PUB]

5 rows × 40 columns

len(only_updated_df)
115
only_updated_df.index
RangeIndex(start=0, stop=115, step=1)

One of the groups (now an individual row) that had more than one row

only_updated_df[only_updated_df['id'] == size_2_groups[0]][assembled_historical_cols]
updated ContractFolderStatus - ContractFolderStatusCode
95 [2022-01-01 00:00:16.761000+00:00, 2022-01-01 ... [EV, EV]

Deleted series

import sproc.bundle
zip_file = directory / 'yearly' / 'PlataformasAgregadasSinMenores_2018.zip'
assert zip_file.exists()
zip_file
PosixPath('/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples/yearly/PlataformasAgregadasSinMenores_2018.zip')
deleted_series = sproc.bundle.read_deleted_zip(zip_file)
deleted_series.head(2)
zip                                      file name                                              id                                                                                 
PlataformasAgregadasSinMenores_2018.zip  PlataformasAgregadasSinMenores_20180217_180137_1.atom  https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1985903   2018-01-04 13:11:18.021000+00:00
                                                                                                https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1969197   2018-01-04 13:11:17.921000+00:00
Name: deleted_on, dtype: datetime64[ns, UTC]

No duplicates

deleted_series.index.get_level_values(1).duplicated().any()
True

We add an artificial one

new_data = pd.Timestamp(year=2017, month=1, day=3, tz=deleted_series[-1].tz)
deleted_series.loc['foo', deleted_series.index[-1][1], deleted_series.index[-1][2]] = new_data

Now have a duplicate (at level 1, i.e., id)

deleted_series.index.get_level_values(2).duplicated().any()
True

A function to get rid of duplicates from a deleted series. The oldest entry is kept.


source

deduplicate_deleted_series

 deduplicate_deleted_series (series:pandas.core.series.Series)
deduplicated_series = deduplicate_deleted_series(deleted_series)
deduplicated_series.tail(2)
zip                                      file name                                              id                                                                                 
PlataformasAgregadasSinMenores_2018.zip  PlataformasAgregadasSinMenores_20180217_190110_1.atom  https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/1996094   2018-02-07 09:01:16.378000+00:00
foo                                      PlataformasAgregadasSinMenores_20180217_190110_1.atom  https://contrataciondelestado.es/sindicacion/PlataformasAgregadasSinMenores/2000163          2017-01-03 00:00:00+00:00
Name: deleted_on, dtype: datetime64[ns, UTC]
assert not deduplicated_series.index.get_level_values(2).duplicated().any()
deleted_series.shape, deduplicated_series.shape
((52,), (51,))