parse

Parsing stuff

Sample data

Directory where the data (XML files) are stored

directory = pathlib.Path.cwd().parent / 'samples'
assert directory.exists()
directory
PosixPath('/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples')

A (sample) file in that directory

xml_file = directory / 'PlataformasAgregadasSinMenores_20220104_030016_1.atom'
assert xml_file.exists()
xml_file
PosixPath('/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples/PlataformasAgregadasSinMenores_20220104_030016_1.atom')

XML file is read into a hierarchical (column-multiindex) pd.DataFrame. CAVEAT: a hierarchical pd.DataFrame is assumed.

df = sproc.hier.flat_df_to_multiindexed_df(sproc.xml.to_curated_df(xml_file))
df.head()
id summary title updated ContractFolderStatus
ContractFolderID ContractFolderStatusCode LocatedContractingParty ProcurementProject ... TechnicalDocumentReference ProcurementProject LocatedContractingParty TenderingProcess TenderResult TenderingProcess
BuyerProfileURIID Party ParentLocatedParty Name ... ID Attachment PlannedPeriod Party ParentLocatedParty ParticipationRequestReceptionPeriod AwardedTenderedProject TenderSubmissionDeadlinePeriod
PartyName PartyName ... ExternalReference StartDate EndDate PartyIdentification ParentLocatedParty EndDate EndTime ProcurementProjectLotID
Name Name ... URI ID PartyName
... Name
0 https://contrataciondelestado.es/sindicacion/P... Id licitación: C. 2-2021; Órgano de Contrataci... L'objecte del contracte és la renovació de tot... 2022-01-03 00:11:41.826000+00:00 C. 2-2021 ADJ https://contractaciopublica.gencat.cat/ecofin_... Ajuntament de Sant Ramon Entitats municipals de Catalunya L'objecte del contracte és la renovació de tot... ... <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> NaN 2021-12-17 14:00:00+00:00
1 https://contrataciondelestado.es/sindicacion/P... Id licitación: 8128_3/2021; Órgano de Contrata... Obras de restauración hidromorfológica del río... 2022-01-03 00:00:11.194000+00:00 8128_3/2021 PUB <NA> Pleno del Ayuntamiento AYUNTAMIENTO DE MONREAL Obras de restauración hidromorfológica del río... ... <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> NaN 2022-01-22 23:30:00+00:00
2 https://contrataciondelestado.es/sindicacion/P... Id licitación: 1000_0005-CP01-2021-000063; Órg... Contrato del servicio de realización de labore... 2022-01-03 00:00:10.399000+00:00 1000_0005-CP01-2021-000063 EV <NA> El Director General de Comunicación y Relacion... Departamento de Presidencia, Igualdad, Función... Contrato del servicio de realización de labore... ... <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> NaN NaT
3 https://contrataciondelestado.es/sindicacion/P... Id licitación: 1379/2020 4738; Órgano de Contr... Obres de renovació de l'enllumenat públic a la... 2022-01-02 23:11:40.740000+00:00 1379/2020 4738 EV https://contractaciopublica.gencat.cat/ecofin_... Ajuntament de Canet de Mar Entitats municipals de Catalunya Obres de renovació de l'enllumenat públic a la... ... <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> NaN 2022-01-02 23:59:00+00:00
4 https://contrataciondelestado.es/sindicacion/P... Id licitación: 2021-44; Órgano de Contratación... Subministre i la instal·lació fotovoltaica en ... 2022-01-02 23:11:40.696000+00:00 2021-44 EV https://contractaciopublica.gencat.cat/ecofin_... Ajuntament de Valls Entitats municipals de Catalunya Subministre i la instal·lació fotovoltaica en ... ... Enllac plec clausules tecniques.doc https://contractaciopublica.gencat.cat/ecofin_... <NA> <NA> <NA> <NA> <NA> <NA> NaN 2022-01-02 23:59:00+00:00

5 rows × 39 columns

The column to discriminate regions

It is turned

domain_discriminative_columns = [sproc.hier.pad_col_levels(df, p) for p in domain_discriminative_columns_paths]
domain_discriminative_columns
[('ContractFolderStatus',
  'LocatedContractingParty',
  'BuyerProfileURIID',
  '',
  '',
  ''),
 ('ContractFolderStatus',
  'LegalDocumentReference',
  'Attachment',
  'ExternalReference',
  'URI',
  '')]

Unique values of the first one

unique_values = df[domain_discriminative_columns[0]].unique().tolist()
unique_values[:5]
['https://contractaciopublica.gencat.cat/ecofin_pscp/AppJava/cap.pscp?reqCode=viewDetail&idCap=2763318',
 <NA>,
 'https://contractaciopublica.gencat.cat/ecofin_pscp/AppJava/cap.pscp?reqCode=viewDetail&idCap=29178875',
 'https://contractaciopublica.gencat.cat/ecofin_pscp/AppJava/cap.pscp?reqCode=viewDetail&idCap=8530338',
 'https://contractaciopublica.gencat.cat/ecofin_pscp/AppJava/cap.pscp?reqCode=viewDetail&idCap=16054009']

pd.NA values are filtered out

unique_values = list(filter(pd.notna, unique_values))
unique_values[:5]
['https://contractaciopublica.gencat.cat/ecofin_pscp/AppJava/cap.pscp?reqCode=viewDetail&idCap=2763318',
 'https://contractaciopublica.gencat.cat/ecofin_pscp/AppJava/cap.pscp?reqCode=viewDetail&idCap=29178875',
 'https://contractaciopublica.gencat.cat/ecofin_pscp/AppJava/cap.pscp?reqCode=viewDetail&idCap=8530338',
 'https://contractaciopublica.gencat.cat/ecofin_pscp/AppJava/cap.pscp?reqCode=viewDetail&idCap=16054009',
 'https://contractaciopublica.gencat.cat/ecofin_pscp/AppJava/cap.pscp?reqCode=viewDetail&idCap=15937468']

A one-liner returning a pd.DataFrame of columns with parsed domains

domains = df[domain_discriminative_columns].applymap(lambda x: urllib.parse.urlparse(x).netloc if pd.notna(x) else pd.NA)
domains
ContractFolderStatus
LocatedContractingParty LegalDocumentReference
BuyerProfileURIID Attachment
ExternalReference
URI
0 contractaciopublica.gencat.cat <NA>
1 <NA> <NA>
2 <NA> <NA>
3 contractaciopublica.gencat.cat contractaciopublica.gencat.cat
4 contractaciopublica.gencat.cat contractaciopublica.gencat.cat
... ... ...
112 NaN NaN
113 NaN NaN
114 NaN NaN
115 NaN NaN
116 NaN NaN

117 rows × 2 columns

How many non-nulls are in every column?

domains.notna().sum()
ContractFolderStatus  LocatedContractingParty  BuyerProfileURIID                              42
                      LegalDocumentReference   Attachment         ExternalReference  URI      70
dtype: int64

What about the combination?

domains[domain_discriminative_columns[0]].combine_first(domains[domain_discriminative_columns[1]]).notna().sum()
77

A function returning a pd.Series with domains


source

domain

 domain (df:pandas.core.frame.DataFrame)

Extract the (internet) domains from the given data

Type Details
df DataFrame Input
Returns Series Domains

The function adds a new column to the pd.DataFrame inplace

domain(df)
0      contractaciopublica.gencat.cat
1                                <NA>
2                                <NA>
3      contractaciopublica.gencat.cat
4      contractaciopublica.gencat.cat
                    ...              
112                              <NA>
113                              <NA>
114                              <NA>
115                              <NA>
116                              <NA>
Name: (ContractFolderStatus, LocatedContractingParty, BuyerProfileURIID, , , ), Length: 117, dtype: object

Dates

A function to parse a string containing either a year or a year and a month


source

year_and_maybe_month

 year_and_maybe_month (s:str)
Type Details
s str Raw date
Returns datetime Parsed date

Only a year

year_and_maybe_month('2021')
datetime.datetime(2021, 12, 1, 0, 0)

A year and a month

year_and_maybe_month('202205')
datetime.datetime(2022, 1, 5, 0, 0)