= pathlib.Path.cwd().parent / 'samples'
directory assert directory.exists()
directory
PosixPath('/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples')
Directory where the data (XML files) are stored
PosixPath('/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples')
A (sample) file in that directory
xml_file = directory / 'PlataformasAgregadasSinMenores_20220104_030016_1.atom'
assert xml_file.exists()
xml_file
PosixPath('/home/manu/Sync/UC3M/proyectos/2022/nextProcurement/sproc/samples/PlataformasAgregadasSinMenores_20220104_030016_1.atom')
XML file is read into a hierarchical (column-multiindex) pd.DataFrame
. CAVEAT: a hierarchical pd.DataFrame
is assumed.
id | summary | title | updated | ContractFolderStatus | |||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ContractFolderID | ContractFolderStatusCode | LocatedContractingParty | ProcurementProject | ... | TechnicalDocumentReference | ProcurementProject | LocatedContractingParty | TenderingProcess | TenderResult | TenderingProcess | |||||||||||
BuyerProfileURIID | Party | ParentLocatedParty | Name | ... | ID | Attachment | PlannedPeriod | Party | ParentLocatedParty | ParticipationRequestReceptionPeriod | AwardedTenderedProject | TenderSubmissionDeadlinePeriod | |||||||||
PartyName | PartyName | ... | ExternalReference | StartDate | EndDate | PartyIdentification | ParentLocatedParty | EndDate | EndTime | ProcurementProjectLotID | |||||||||||
Name | Name | ... | URI | ID | PartyName | ||||||||||||||||
... | Name | ||||||||||||||||||||
0 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: C. 2-2021; Órgano de Contrataci... | L'objecte del contracte és la renovació de tot... | 2022-01-03 00:11:41.826000+00:00 | C. 2-2021 | ADJ | https://contractaciopublica.gencat.cat/ecofin_... | Ajuntament de Sant Ramon | Entitats municipals de Catalunya | L'objecte del contracte és la renovació de tot... | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | 2021-12-17 14:00:00+00:00 |
1 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: 8128_3/2021; Órgano de Contrata... | Obras de restauración hidromorfológica del río... | 2022-01-03 00:00:11.194000+00:00 | 8128_3/2021 | PUB | <NA> | Pleno del Ayuntamiento | AYUNTAMIENTO DE MONREAL | Obras de restauración hidromorfológica del río... | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | 2022-01-22 23:30:00+00:00 |
2 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: 1000_0005-CP01-2021-000063; Órg... | Contrato del servicio de realización de labore... | 2022-01-03 00:00:10.399000+00:00 | 1000_0005-CP01-2021-000063 | EV | <NA> | El Director General de Comunicación y Relacion... | Departamento de Presidencia, Igualdad, Función... | Contrato del servicio de realización de labore... | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | NaT |
3 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: 1379/2020 4738; Órgano de Contr... | Obres de renovació de l'enllumenat públic a la... | 2022-01-02 23:11:40.740000+00:00 | 1379/2020 4738 | EV | https://contractaciopublica.gencat.cat/ecofin_... | Ajuntament de Canet de Mar | Entitats municipals de Catalunya | Obres de renovació de l'enllumenat públic a la... | ... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | 2022-01-02 23:59:00+00:00 |
4 | https://contrataciondelestado.es/sindicacion/P... | Id licitación: 2021-44; Órgano de Contratación... | Subministre i la instal·lació fotovoltaica en ... | 2022-01-02 23:11:40.696000+00:00 | 2021-44 | EV | https://contractaciopublica.gencat.cat/ecofin_... | Ajuntament de Valls | Entitats municipals de Catalunya | Subministre i la instal·lació fotovoltaica en ... | ... | Enllac plec clausules tecniques.doc | https://contractaciopublica.gencat.cat/ecofin_... | <NA> | <NA> | <NA> | <NA> | <NA> | <NA> | NaN | 2022-01-02 23:59:00+00:00 |
5 rows × 39 columns
The column to discriminate regions
It is turned
domain_discriminative_columns = [sproc.hier.pad_col_levels(df, p) for p in domain_discriminative_columns_paths]
domain_discriminative_columns
[('ContractFolderStatus',
'LocatedContractingParty',
'BuyerProfileURIID',
'',
'',
''),
('ContractFolderStatus',
'LegalDocumentReference',
'Attachment',
'ExternalReference',
'URI',
'')]
Unique values of the first one
['https://contractaciopublica.gencat.cat/ecofin_pscp/AppJava/cap.pscp?reqCode=viewDetail&idCap=2763318',
<NA>,
'https://contractaciopublica.gencat.cat/ecofin_pscp/AppJava/cap.pscp?reqCode=viewDetail&idCap=29178875',
'https://contractaciopublica.gencat.cat/ecofin_pscp/AppJava/cap.pscp?reqCode=viewDetail&idCap=8530338',
'https://contractaciopublica.gencat.cat/ecofin_pscp/AppJava/cap.pscp?reqCode=viewDetail&idCap=16054009']
pd.NA
values are filtered out
['https://contractaciopublica.gencat.cat/ecofin_pscp/AppJava/cap.pscp?reqCode=viewDetail&idCap=2763318',
'https://contractaciopublica.gencat.cat/ecofin_pscp/AppJava/cap.pscp?reqCode=viewDetail&idCap=29178875',
'https://contractaciopublica.gencat.cat/ecofin_pscp/AppJava/cap.pscp?reqCode=viewDetail&idCap=8530338',
'https://contractaciopublica.gencat.cat/ecofin_pscp/AppJava/cap.pscp?reqCode=viewDetail&idCap=16054009',
'https://contractaciopublica.gencat.cat/ecofin_pscp/AppJava/cap.pscp?reqCode=viewDetail&idCap=15937468']
A one-liner returning a pd.DataFrame
of columns with parsed domains
domains = df[domain_discriminative_columns].applymap(lambda x: urllib.parse.urlparse(x).netloc if pd.notna(x) else pd.NA)
domains
ContractFolderStatus | ||
---|---|---|
LocatedContractingParty | LegalDocumentReference | |
BuyerProfileURIID | Attachment | |
ExternalReference | ||
URI | ||
0 | contractaciopublica.gencat.cat | <NA> |
1 | <NA> | <NA> |
2 | <NA> | <NA> |
3 | contractaciopublica.gencat.cat | contractaciopublica.gencat.cat |
4 | contractaciopublica.gencat.cat | contractaciopublica.gencat.cat |
... | ... | ... |
112 | NaN | NaN |
113 | NaN | NaN |
114 | NaN | NaN |
115 | NaN | NaN |
116 | NaN | NaN |
117 rows × 2 columns
How many non-nulls are in every column?
ContractFolderStatus LocatedContractingParty BuyerProfileURIID 42
LegalDocumentReference Attachment ExternalReference URI 70
dtype: int64
What about the combination?
domains[domain_discriminative_columns[0]].combine_first(domains[domain_discriminative_columns[1]]).notna().sum()
77
A function returning a pd.Series
with domains
domain (df:pandas.core.frame.DataFrame)
Extract the (internet) domains from the given data
Type | Details | |
---|---|---|
df | DataFrame | Input |
Returns | Series | Domains |
The function adds a new column to the pd.DataFrame
inplace
A function to parse a string containing either a year or a year and a month
year_and_maybe_month (s:str)
Type | Details | |
---|---|---|
s | str | Raw date |
Returns | datetime | Parsed date |
Only a year
A year and a month